CSTutorBench: Benchmarking Large Language Models for Realistic Computer Science TutoringGlobal
Large Language Models (LLMs) show promise for CS educational assistance; however, the absence of comprehensive benchmarks limits our ability to assess their effectiveness in real-world teaching scenarios accurately. To fill this gap, we present CSTutorBench, a dataset from authentic course discussion forums with 2,970 multimodal question–answer pairs. Additionally, we propose an evaluation framework across five dimensions—accuracy, clarity, conciseness, personalization, and interactivity—to gauge the performance of various models in tutoring settings. We benchmark leading LLMs—including GPT-4o, Claude, Llama 4, and others—using both automated metrics and expert human assessments. Across these real CS tutoring exchanges, we found leading LLMs approach human performance in terms of accuracy and clarity, but fall notably short on personalization and interactive scaffolding, often producing fluent yet less actionable help.