A Verification-First, Self-Healing Framework for LLM-Enabled Generation of CS1 Exercises
Large CS1 courses routinely need several versions of the same idea, such as practice items, make-ups, and multi-form exams that target the same learning objective. The manual authoring of these isomorphic items is slow, and automatic generation of one shot often wanders off concept, changes difficulty, or produces code that does not run. We address this problem with a self-healing framework for generating CS1 exercise variants. The design is generator-agnostic. In our framework, we use three separate role instances of a Large Language Model (LLM): a Generator that proposes candidate items, an Evaluator that checks them against constraints extracted from the base problem, and a Solver that produces a reference solution. Acceptance is determined only by executable docstring tests (Python doctests). When tests fail, a lightweight controller turns failure traces into targeted repairs and retries. In short, generation proposes and tests decide.
We applied the framework to CS1 programming problems and report preliminary evidence on feasibility and classroom fit. Across the items we generated, concept fidelity held, perceived difficulty remained largely stable relative to the base problems, and reference solutions were usually correct, with a minority that were correct but overly terse for instruction. Executable verification improved after the repair loop, closing most initial failures while maintaining alignment with the original learning objectives. These results suggest that a verification-first workflow can turn fast, fallible generation into classroom ready items while preserving pedagogical intent.