STET
Pinned W1→W2 comparison

Zod (TypeScript) · 28-task shared slice

Zod (TypeScript) is tightly contested: GPT-5.4 edges out GPT-5.3 Codex at 79% vs 75% — but quality separates them: Claude Opus 4.6 leads equivalence at 67%; GPT-5.4 costs 13x less.

Updated Mar 19, 2026·435 runs·3 repos·Judge: GPT-5.3 Codex
Model
28 tasks per model
Week 3
The test-based pass/fail bar
78.6%
GPT-5.4
Match the intended fix?
66.7%
Claude Opus 4.6 · all 39.3%
Would a reviewer approve?
58.3%
Claude Opus 4.6 · all 32.1%
Surgical or over-edited?
0.0%
Claude Opus 4.6 · lowest
API spend per task
$1.99
GPT-5.4
GPT-5.4codex cli
78.6% 0.0pp
78.6%78.6% · n=28
45.5% 0.0pp
all 39.3%
31.8% 0.0pp
all 25.0%
18.2% 0.0pp
all 28.6%
$1.99
GPT-5.3 Codexcodex cli
75.0% 0.0pp
75.0%75.0% · n=28
42.9% 0.0pp
all 35.7%
19.0% 0.0pp
all 14.3%
28.6% 0.0pp
all 32.1%
$16.90
GPT-5.1 Codex Minicodex cli
75.0% 0.0pp
75.0%75.0% · n=28
19.0% 0.0pp
all 14.3%
9.5% 0.0pp
all 7.1%
47.6% 0.0pp
all 50.0%
$5.55
Claude Opus 4.6claude code
Equivalence and code review for this run were graded with GPT-5.4 and are not directly comparable to prior weeks.
42.9%new
42.9%42.9% · n=28
66.7%
all 39.3%
58.3%
all 32.1%
0.0%
all 3.6%
$26.80
GPT-5.4 Minicodex cli
Equivalence and code review for this run were graded with GPT-5.4 and are not directly comparable to prior weeks.
32.1%new
32.1%32.1% · n=28
11.1%
all 14.3%
16.7%
all 18.2%
77.8%
all 60.7%
$3.34