Examining Students’ Code Comprehension with LLMs in Block- and Text-Based Programming
Understanding how students reason about code is essential for providing tailored scaffolding in computer science (CS) education. Prior work has used think-aloud protocols with the Structure of the Observed Learning Outcomes (SOLO) taxonomy to examine students’ code comprehension and programming levels. However, analyzing such data is labor-intensive and requires expert judgment. Recent advances in large language models (LLMs) offer a promising avenue for scaling this analysis, though their reliability for fine-grained coding remains uncertain. To address this gap, our study investigates the extent to which GPT-5 and 4o can classify SOLO levels and identify code-comprehension strategies from think-aloud transcripts of 27 high-school students working on block-based and text-based tasks. Results show modest alignment with human ratings for SOLO, with one-shot prompting improving agreement over zero-shot, though distinctions between adjacent lower levels (e.g., Prestructural 1 vs. 2) remained difficult. Strategy detection demonstrated stronger performance, achieving accuracies of 75–77% (block) and 62–67% (text), particularly for surface-visible strategies such as walkthroughs',control-structure identification’, and pattern recognition', but weaker for less frequent, abstract, meta-cognitive strategies such asstrategizing’ (planning an approach) or `thoroughness’ (systematically checking work). These findings highlight both the potential and the limitations of using GPT-5 and 4o to analyze think-aloud data. While this work represents an initial step, with plans to examine more models, our preliminary results indicate that a human-in-the-loop approach is essential to ensure reliability and interpretive depth. Future work will extend this evaluation to other LLMs to better understand their role in supporting instructional decision-making.