Improving the Reliability of Grading Written-Response Coding Questions in a Large CS1 Course
Fair and consistent assessment of student learning is critical in educational settings, particularly when evaluating the impact of instructional innovations. Although widely used for efficiency, output-based auto-grading often falls short in capturing partial understanding—limiting its effectiveness for measuring learning gains. This paper presents an empirical evaluation of a rubric-based, question-focused, double-grading protocol for written-response (WR) coding questions in pre- and post-tests from a large introductory programming course. This work provides both methodological insights and practical guidance for scaling reliable grading of open-ended coding questions.
To balance efficiency and accuracy, each grader scored a specific question across all submissions, with two graders assigned per item. Adjudication was triggered when score differences exceeded a 20% threshold. Intraclass Correlation Coefficient (ICC) analysis identified two questions with low inter-rater reliability. After rubric clarification and regrading, reliability improved substantially, with ICC values ranging from 0.892 to 0.967 (all data) and 0.831 to 0.875 (excluding zero scores).
We describe the iterative development of the assessment process and show how this structured approach—combined with ICC analysis as a diagnostic tool and targeted adjudication—achieves strong inter-grader reliability. The framework is scalable and robust for WR coding question evaluation in CS1 settings and is adaptable to a range of instructional contexts. These findings support instructors and researchers seeking consistent, practical methods for assessing open-ended student work in programming courses.