Benchmarking AI Tools for Software Engineering Education: Insights into Design, Implementation, and Testing
As generative AI (Gen AI) tools reshape software engineering (SE) workflows, educators are exploring how to meaningfully integrate them into computing education. This experience report presents a structured benchmarking of widely used AI tools—such as GitHub Copilot, GPT-4, Codeium, Claude 3.5, Gemini 1.5, Supermaven, TabNine, Testim, Postman, Eraser.io, and Lucidchart AI—across key SE phases: design, implementation, debugging, and testing. Tools were selected based on industry relevance, accessibility for students, and alignment with common SE tasks. Through controlled experiments conducted by five AI-experienced evaluators with matched exposure levels, we assessed tool performance using standardized prompts, counterbalanced task roles, and a range of proxy metrics—including prompt iterations, task completion time, human correction burden, hallucination frequency, output accuracy, and cross-file consistency—to capture both cognitive load and tool limitations. While AI tools accelerated tasks such as boilerplate generation and UML sketching, they exhibited challenges in test coverage quality, cross-file coherence, and reliability under complex prompts. We discuss educational implications, including managing cognitive load, aligning tools with task types, and explicitly teaching prompt refinement and verification strategies. The paper offers actionable guidance for instructors, curriculum-ready artifacts, and a roadmap for scaling AI integration in SE classrooms, while also noting key limitations to support replication and contextual adoption.