Detecting AI-Generated Code in Introductory Programming Courses (SIGCSE TS 2026 - Papers)

Wed 18 - Sat 21 February 2026 St. Louis, Missouri, United States

Who

Aryan Ramachandra, Suhani Chaudhary, Justin Tran, Riti Desai, Ashley Pang, Mariam Salloum

Track

SIGCSE TS 2026 Papers

Abstract

With the rapid surge of generative AI, many tools have been introduced, such as Google’s Gemini and OpenAI’s GPT-4, with the well-intentioned goal of supporting programmers [5, 8]. These tools can be used by professional programmers to help write code efficiently as well as support debugging and testing; however, we recently began to notice an increase in the number of novice programmers who become highly dependent on Large Language Models (LLMs) to code for them rather than using LLMs as a learning tool [2, 10]. In our CS1 course, approximately 10-15% of the students (out of~350) were cited for academic misconduct due to direct plagiarism from LLMs, many of which performed poorly due to an over-reliance on generative AI.

To address this, we developed a machine learning-based tool to detect AI-generated code. The tool utilized datasets consisting of thousands of student submissions (in C++) from introductory programming courses and we created an equal number of AI-generated solutions using carefully curated prompts based on the same programming prompts. We trained traditional ML models (Random Forest, XGBoost, etc.) on a labeled dataset, and our best-performing model achieved high precision and recall. Notably, the models remained robust even when trained with noisy data that included AI-generated samples. Our goal is to provide the community with a model that can be customized to any course program to encourage early detection and intervention of plagiarized code generated by LLMs.

Aryan Ramachandra

pc

United States

Suhani Chaudhary

University of California, Riverside