On the Efficacy of Using Large Language Models for Automatic Grading of CS Theory Problems
Automating the grading of theoretical computer science (TCS) problems has the potential to save instructors significant time while providing students with consistent feedback. This poster explores the efficacy of a large language model (LLM) in automatically grading undergraduate-level TCS problems. We focus on common topics such as creating context-free grammars and pushdown automata, converting grammars to Chomsky Normal Form, designing Turing machines, and proving languages undecidable via reductions. For each problem type, we developed a rubric and generated a dataset of student-like solutions, some correct and others with typical student errors. The LLM was tasked with scoring these solutions according to the rubric, and its grades were compared to human instructor grades. Our results show that the LLM can often mimic human grading trends with moderate accuracy, but notable discrepancies arise in specific rubric categories. In particular, the LLM tends to over-penalize formatting and notation issues and sometimes overlooks deeper logical errors.