Generative AI (GenAI) now pervades computer science education, but its pedagogical value depends on how it is integrated. This study explores whether evaluating AI-generated solutions can be as effective as solving problems directly. In a large upper-division algorithms course, we conducted a twelve-week randomized A/B crossover ($N=220$) where students alternated between grading ChatGPT’s answers and solving comparable problems themselves. Across exams and overall course grades, performance was statistically indistinguishable between conditions. Homework differences favored whichever cohort encountered the easier half of the syllabus, suggesting task difficulty (as opposed to the evaluation activity) drove those deltas. Surveys showed neutral-to-positive perceptions, with students who reported changing their study habits rating the GenAI-evaluation activity as more helpful. We discuss design choices for GenAI-evaluation tasks that better elicit metacognition without harming achievement.