Aligning Small Language Models for Programming Feedback: Towards Scalable Coding Support in a Massive Global Course
Providing timely and actionable feedback is essential for students learning to program. While large language models (LLMs) are increasingly used to automate this process, they remain costly to deploy and raise concerns around privacy and institutional control. Small language models (SLMs) offer a promising alternative: they can be run locally and integrated more flexibly into educational platforms. However, their out-of-the-box performance is often poor, requiring targeted training to be effective in classrooms. In this paper, we investigate whether a trained 3B-parameter SLM, guided by rubric-based prompting and a pipeline combining supervised and reinforcement learning, can generate diagnostic feedback that approaches the quality of larger models. We deploy the model in a large-scale online programming course and compare its feedback to its base and fine-tuned variants, LLaMA-3.1-8B, and GPT-4.1, using human ratings from 53 teaching assistants and an automated LLM-as-a-judge analysis. Our results show that careful training narrows the feedback quality gap between an SLM and an LLM from over 80 to just 10 percentage points on key metrics. The trained SLM more rarely hallucinates errors, is often rated as helpful by educators, and only occasionally misses issues in student code. These findings suggest that small models can serve as practical and scalable targeted feedback solutions in large educational settings, while LLMs may remain necessary for more comprehensive diagnostic feedback.