Examining Students’ Code Comprehension with LLMs in Block- and Text-Based Programming
This program is tentative and subject to change.
Understanding how students reason about code is essential for providing tailored scaffolding in computer science (CS) education. Prior work has used think-aloud protocols with the Structure of the Observed Learning Outcomes (SOLO) taxonomy to examine students’ code comprehension and programming levels. However, analyzing such data is labor-intensive and requires expert judgment. Recent advances in large language models (LLMs) offer a promising avenue for scaling this analysis, though their reliability for fine-grained coding remains uncertain. To address this gap, our study investigates the extent to which GPT-5 and 4o can classify SOLO levels and identify code-comprehension strategies from think-aloud transcripts of 27 high-school students working on block-based and text-based tasks. Results show modest alignment with human ratings for SOLO, with one-shot prompting improving agreement over zero-shot, though distinctions between adjacent lower levels (e.g., Prestructural 1 vs. 2) remained difficult. Strategy detection demonstrated stronger performance, achieving accuracies of 75–77% (block) and 62–67% (text), particularly for surface-visible strategies such as walkthroughs',control-structure identification’, and pattern recognition', but weaker for less frequent, abstract, meta-cognitive strategies such asstrategizing’ (planning an approach) or `thoroughness’ (systematically checking work). These findings highlight both the potential and the limitations of using GPT-5 and 4o to analyze think-aloud data. While this work represents an initial step, with plans to examine more models, our preliminary results indicate that a human-in-the-loop approach is essential to ensure reliability and interpretive depth. Future work will extend this evaluation to other LLMs to better understand their role in supporting instructional decision-making.