Exploring the Use of LLMs for Assessing Creativity in Student Programming Artifacts
Creativity is a critical learning outcome in K–12 computer science, yet assessing it at scale remains challenging. Human-scored approaches, such as the Consensual Assessment Technique (CAT), are resource-intensive and prone to rater variability. Leveraging advances in large language models (LLMs), this study investigates whether GPT-4o can reliably assess creativity in student-generated code. We collected 383 flow-based music programs from 194 upper-elementary students (ages 10–12) between 2022 and 2024. Each artifact was rated by five human experts across four dimensions: originality, complexity, efficiency, and emotional expressiveness. We evaluated three prompting strategies—zero-shot, few-shot with theory-driven exemplars (ECD), and few-shot with human-selected examples. Among them, the ECD-based few-shot prompting yielded the best performance, achieving the lowest mean absolute error (MAE = 0.582) and highest agreement within ±1.0 of human scores (82.4%). Zero-shot prompting, while slightly less accurate, achieved the highest correlation with human scores (Spearman’s p = 0.53), suggesting its potential for lightweight deployment. In contrast, human-selected prompts introduced more scoring bias. Compared to a static rule-based baseline, all LLM strategies significantly improved prediction accuracy without requiring handcrafted features. These findings highlight LLMs’ potential as scalable, interpretable tools for creativity assessment in block-based programming. We release our anonymized dataset, prompt templates, and code to encourage replication and further research.