Providing timely and actionable feedback is essential for students learning to program. While large language models (LLMs) are increasingly used to automate this process, they remain costly to deploy and raise concerns around privacy and institutional control. Small language models (SLMs) offer a promising alternative: they can be run locally and integrated more flexibly into educational platforms. However, their out-of-the-box performance is often poor, requiring targeted training to be effective in classrooms. In this paper, we investigate whether a trained 3B-parameter SLM, guided by rubric-based prompting and a pipeline combining supervised and reinforcement learning, can generate diagnostic feedback that approaches the quality of larger models. We deploy the model in a large-scale online programming course and compare its feedback to its base and fine-tuned variants, LLaMA-3.1-8B, and GPT-4.1, using human ratings from 53 teaching assistants and an automated LLM-as-a-judge analysis. Our results show that careful training narrows the feedback quality gap between an SLM and an LLM from over 80 to just 10 percentage points on key metrics. The trained SLM more rarely hallucinates errors, is often rated as helpful by educators, and only occasionally misses issues in student code. These findings suggest that small models can serve as practical and scalable targeted feedback solutions in large educational settings, while LLMs may remain necessary for more comprehensive diagnostic feedback.

Fri 20 Feb

Displayed time zone: Central Time (US & Canada) change

10:40 - 12:00
Scaling Feedback and Assessments Without Losing Your Sanity (or Your Servers)Papers at Meeting Room 100
Chair(s): Ross T. Sowell Sewanee
10:40
20m
Talk
Aligning Small Language Models for Programming Feedback: Towards Scalable Coding Support in a Massive Global CourseGlobal
Papers
Charles Koutcheme Aalto University, Juliette Woodrow Stanford University, Chris Piech Stanford University
11:00
20m
Talk
Designing and Implementing Skill Tests at Scale: Frequent, Computer-Based, Proctored Assessments with Minimal Infrastructure Requirements
Papers
Anastasiya Markova University of California San Diego, Anish Kasam University of California San Diego, Bryce Hackel University of California San Diego, Marina Langlois University of California San Diego, Sam Lau University of California at San Diego
11:20
20m
Talk
Scaling Engagement: Leveraging Social Annotation and AI for Collaborative Code Review in Large CS Courses
Papers
Raymond Klefstad University of California, Irvine, Susan Klefstad Independent Researcher, Vincent Tran University of California, Irvine, Michael Shindler University of California, Irvine