Leaderboard results on real open-source repos

Pinned W1→W2 comparison

Zod (TypeScript) · 28-task shared slice

“Zod (TypeScript) is tightly contested: GPT-5.4 edges out GPT-5.3 Codex at 79% vs 75% — but quality separates them: Claude Opus 4.6 leads equivalence at 67%; GPT-5.4 costs 13x less.”

Updated Mar 19, 2026·435 runs·3 repos·Judge: GPT-5.3 Codex

28 tasksWeek 3

Model 28 tasks per model Week 3	The test-based pass/fail bar 78.6% GPT-5.4	Match the intended fix? 66.7% Claude Opus 4.6 · all 39.3%	Would a reviewer approve? 58.3% Claude Opus 4.6 · all 32.1%	Surgical or over-edited? 0.0% Claude Opus 4.6 · lowest	API spend per task $1.99 GPT-5.4
▸GPT-5.4codex cli	78.6% 0.0pp 78.6%–78.6% · n=28	45.5% 0.0pp all 39.3%	31.8% 0.0pp all 25.0%	18.2% 0.0pp all 28.6%	$1.99
▸GPT-5.3 Codexcodex cli	75.0% 0.0pp 75.0%–75.0% · n=28	42.9% 0.0pp all 35.7%	19.0% 0.0pp all 14.3%	28.6% 0.0pp all 32.1%	$16.90
▸GPT-5.1 Codex Minicodex cli	75.0% 0.0pp 75.0%–75.0% · n=28	19.0% 0.0pp all 14.3%	9.5% 0.0pp all 7.1%	47.6% 0.0pp all 50.0%	$5.55
▸Claude Opus 4.6claude code Equivalence and code review for this run were graded with GPT-5.4 and are not directly comparable to prior weeks.	42.9%new 42.9%–42.9% · n=28	66.7% all 39.3%	58.3% all 32.1%	0.0% all 3.6%	$26.80
▸GPT-5.4 Minicodex cli Equivalence and code review for this run were graded with GPT-5.4 and are not directly comparable to prior weeks.	32.1%new 32.1%–32.1% · n=28	11.1% all 14.3%	16.7% all 18.2%	77.8% all 60.7%	$3.34