STET

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo

May 12, 2026

I ran Opus 4.7 in Claude Code at all reasoning effort settings (low, medium, high, xhigh, and max) on the same 29 tasks from an open source repo (GraphQL-go-tools, in Go).

On this slice, Opus 4.7 did not behave like a model where more reasoning effort had a linear correlation with more intelligence. In fact, the curve appears to peak at medium.

If you think this is weird, I agree! This was the follow-up to a Zod run where Opus also looked non-monotonic. I reran the question on GraphQL-go-tools because I wanted a more discriminating repo slice and didn’t trust the fact that more reasoning != better outcomes. Running on the GraphQL repo helped clarified the result: Opus still did not show a simple higher-reasoning-is-better curve.

The contrast is GPT-5.5 in Codex, which overall did show the intuitive curve: more reasoning bought more semantic/review quality. That post is here: https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve

Medium has the best test pass rate, highest equivalence with the original human-authored changes, the best code-review pass rate, and the best aggregate craft/discipline rate. Low is cheaper and faster, but it drops too much correctness. High, xhigh, and max spend more time and money without beating medium on the metrics that matter.

More reasoning effort doesn't only cost more - it changes the way Claude works, but without reliably improving judgment. Xhigh inflates the test/fixture surface most. Max is busier overall and has the largest implementation-line footprint. But even though both are supposedly thinking more, neither produces "better" patches than medium.

One likely reason: Opus 4.7 uses adaptive thinking - the model already picks its own reasoning budget per task, so the effort knob biases an already-adaptive policy rather than buying more intelligence. More on this below.

An illuminating example is PR #1260. After retry, medium recovered into a real patch. High and xhigh used their extra reasoning budget to dig up commit hashes from prior PRs and confidently declare "no work needed" - voluntarily ending the turn with no patch. Medium and max read the literal control flow and made the fix.

One broader takeaway for me: this should not have to be a one-off manual benchmark. If reasoning level changes the kind of patch an agent writes, the natural next step is to let the agent test and improve its own setup on real repo work.

For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch.

The data:

MetricLowMediumHighXhighMax
All-task pass23/2928/2926/2925/2927/29
Equivalent10/2914/2912/2911/2913/29
Code-review pass5/2910/297/294/298/29
Code-review rubric mean2.4262.7162.5092.4822.431
Footprint risk mean0.1550.1890.2060.2380.227
All custom graders2.5982.7592.6702.6692.690
Mean cost/task$2.50$3.15$5.01$6.51$8.84
Mean duration/task383.8s450.7s716.4s803.8s996.9s
Equivalent passes per dollar0.1380.1530.0830.0580.051

Claude Opus 4.7 on GraphQL-go-tools: medium peaks across pass rate (28/29), equivalence (48%), review pass (35%), and aggregate craft/discipline. High, xhigh, and max each cost more without beating medium on any primary quality metric. The curve is non-monotonic, unlike the GPT-5.5 Codex run on the same repo.

Outcomes

Per-task drilldown - sorted by widest spread

Enter opens profile; Tab moves between rows.

Task profile graphql-go-tools#1260
tests pass
L
fail
M
pass
H
fail
XH
fail
MX
pass
equivalence
L
--
M
-0.86
H
--
XH
--
MX
-0.88
review pass
L
fail
M
fail
H
fail
XH
fail
MX
fail
review rubric
L
--
M
2.50
H
--
XH
--
MX
2.25
all-3 pass
L
0/3
M
1/3
H
0/3
XH
0/3
MX
1/3
footprint risk
L
--
M
0.147
H
--
XH
--
MX
0.166
mean duration
L
333s
M
494s
H
397s
XH
542s
MX
24m 07s
craft
L
0.00
M
2.85
H
0.00
XH
0.00
MX
2.95
discipline
L
0.00
M
2.50
H
0.00
XH
0.00
MX
2.35
Cost and time
Efficiency
Quality - craft and discipline

Inspect-grade five-arm curve on a 29-task GraphQL-go-tools matched slice. Each arm is a candidate-arm run on the same slice; the medium-vs-high, medium-vs-xhigh, low-vs-medium, and medium-vs-max compares are stitched against the same medium baseline. Max is decision-grade for inspect/readout after targeted no-patch retry and infra repair.

Cost authority is the source-arm summary for each level. No-patch rows reduce publishable denominators for low, high, and xhigh, and built-in equivalence / code-review coverage is partial on those rows. Max no longer has infra/no-patch failures after targeted repair.

Code-review rubric means use the flattened RubricScores from each level's source summary. Low has two no-patch rows that drop out of patch-intrinsic rubrics; medium recovered stet-pr-1260 after retry while high and xhigh did not.

Reproduce from the source summaries listed in the raw JSON; regenerate the chart data with cd leaderboard && npm exec tsx scripts/build-opus47-graphql-reasoning-curve.mjs.

Prior Zod signal

The earlier 28-task Zod run was the reason to rerun on GraphQL: tests were flat, while equivalence and review moved around above the gate. Interesting, but not clean enough for the default-setting claim.

four arms only - no max

Tests pass
flat gate
Low12/28
Medium12/28
High12/28
Xhigh12/28
Equivalent
mixed quality signal
Low10/28
Medium16/28
High13/28
Xhigh19/28
Code-review pass
above-test movement
Low4/27
Medium10/27
High10/27
Xhigh11/27

Read: Zod made the non-monotonic Opus behavior visible first. GraphQL is the cleaner follow-up because it uses the same 29-task slice across all five Opus efforts and medium wins the behavioral table outright.

Why I Ran This

After my last post comparing GPT-5.5 vs 5.4 vs Opus 4.7, I was curious how intra-model performance varied with reasoning effort. Doing research online, it's very very hard to gauge what actual experience is like when varying the reasoning levels, and how that applies to the work that I'm doing.

I first ran this on Zod, and the result looked strange: tests were flat across low, medium, high, and xhigh, while the above-test quality signals moved around in mixed ways. Low, medium, high, and xhigh all landed at 12/28 test passes. But equivalence moved from 10/28 on low to 16/28 on medium, 13/28 on high, and 19/28 on xhigh; code-review pass moved from 4/27 to 10/27, 10/27, and 11/27. That was interesting, but not clean enough to make a default-setting claim. It could have been a Zod-specific artifact, or a sign that Opus 4.7 does not have a simple "turn reasoning up" curve.

So I reran the question on GraphQL-go-tools. To separate vibes from reality, and figure out where the cost/performance sweet spot is for Opus 4.7, I wanted the same reasoning-effort question on a more discriminating repo slice.

This is not meant to be a universal benchmark result - I don't have the funds or time to generate statistically significant data. The purpose is closer to "how should I choose the reasoning setting for real repo work?", with GraphQL-Go-Tools as the example repo.

Public benchmarks flatten the reviewer question that most SWEs actually care about: would I actually merge the patch, and do I want to maintain it? That's why I ran this test - to gain more insight, at a small scale, into how coding agents perform on real-world tasks.

Terminal-Bench consists of esoteric problems that mostly aren't encountered in day-to-day software, SWE-bench verified is contaminated (as in models already have answers baked in), and SWE-bench Pro is useful, but generic. That is not a knock on SWE-bench or Terminal-Bench. Standardized benchmarks are useful, but they mostly answer a binary task-outcome question.

The question I care about day to day is narrower and more annoying: did the agent make the same kind of change a human merged in my codebase, and would I want to own the patch afterward?

Experimental Setup

Each task is derived from a real merged PR or commit. The model gets a frozen repo snapshot, a prompt describing the change, and one attempt to produce a patch in a Docker container. Stet then applies the patch and runs the task's tests in an isolated container to check if it passed/failed.

Then Stet grades the result beyond pass/fail:

  • Equivalence: does the candidate patch accomplish the same behavioral change as the original human patch?
  • Code review: would a reviewer accept the patch, considering correctness, introduced-bug risk, maintainability, and edge cases?
  • Footprint risk: how much additional code did the agent touch when compared with the human patch?
  • Craft/discipline rubrics: attempt to capture non-correct aspects of code. Basically, would a reviewer want to maintain this code. The categories are clarity, simplicity, coherence, intentionality, robustness, instruction adherence, scope discipline, and diff minimality

Those metrics exist because tests alone do not answer the thing I actually care about: would this patch be something I want to merge and maintain?

Every model ran once per task with a single seed. The LLM-as-a-judge model was GPT-5.4. Each patch was scored independently - the judge sees the patch and the task, and was blinded to the model/effort that produced the patch. I also manually inspected representative examples as sanity checks. There was no human calibration pass on this task set, so I would trust the direction of the deltas more than any single absolute score.

As an aside, I've also been using these evaluations as an "autoresearch" optimization loop, not just a benchmark. I tell my agent something like "make AGENTS.md better for this repo"; it proposes an edit, runs Stet on historical tasks, figures out where the candidate was better / worse and why, and iterates to improve the evaluation numbers.

Details:

  • Model: Opus 4.7
  • Harness: Claude Code 2.1.126-2.1.138 (varied across arms by run date; npm-installed latest at each run)
  • Dataset: 29 real GraphQL-go-tools tasks.
    • Yes this is small - however running even this used most of my weekly 20x quota
  • Main metrics:
    • test pass
    • semantic equivalence
    • code-review pass
    • footprint risk
    • craft/discipline custom graders
    • cost and runtime

Low: Cheaper, Shallower, and Incomplete

MetricLowMediumΔ
All-task pass23/29, 79.3%28/29, 96.6%-17.2pp
Equivalent10/29, 34.5%14/29, 48.3%-13.8pp
Code-review pass5/29, 17.2%10/29, 34.5%-17.2pp
Footprint risk mean0.1550.189-0.034
Craft/Discipline avg2.5982.759-0.161
Cost/task (mean)$2.50$3.15-$0.65, 0.79x
Mean duration383.8s450.7s-66.8s

Low appears to drive Opus 4.7 to work through most issues on a surface level. It is faster, cheaper, and lower-footprint (touching less files relative to the human-authored change), but misses important pieces of the task, leaving gaps in correctness.

In practice, low is superseded by medium, with just a ~26% increase in cost ($2.50 → $3.15) and a noticeably better performance across the board.

Example: PR #1230 fixes two GraphQL federation query-planner bugs and adds an empty-selection-set guard on the GraphQL datasource print path.

  • Task: tighten the planner's parent-chain selection and add the right-shape validation guard.
  • Lower-effort failure mode: low worked in the wrong boundary, inlining hand-rolled recursive AST helpers directly into graphql_datasource.go rather than registering a planner-scoped validation rule. The unique-node selection logic stayed eager, tests failed, and the patch was non-equivalent with the human PR.
  • Higher-effort change: medium did the same job at the right boundary - a dedicated validation rule wired into the planner's printKitPool - and matched the two-pass planner shape the human PR used.
  • Lesson: low does work, but at the wrong level of abstraction. It tends to inline behavior into the file it happens to be reading rather than picking the package boundary the task is actually about.

Medium: Balance of Restraint and Correctness

MetricMedium
All-task pass28/29, 96.6%
Equivalent14/29, 48.3%
Code-review pass10/29, 34.5%
Footprint risk mean0.189
Craft/Discipline avg2.759
Cost/task (mean)$3.15
Mean duration450.7s

Medium appears to be the level that does enough repo modeling without drifting into prior-PR rationalization, no-op stories, or oversized patch surface.

It has the best test pass count, is the most equivalent with the human patches, passes code review at a higher rate, and performs the best on the craft/discipline rubrics.

When looking at the original Zod slice, medium improved over low, but the higher-effort signal was mixed: xhigh had the best equivalence rate, high had the best discipline average, and tests stayed flat. GraphQL is the cleaner medium-wins read.

Medium spends its extra effort productively - looking at the agent trajectories, it runs more tests than high/xhigh while avoiding the bloated time/tokens from max. On this slice, medium looks like the local optimum: enough reasoning to execute the user's intent, without going down too many rabbit holes.

Example: PR #1260 makes GraphQL subscription query plans include trigger metadata (subgraph name/ID, trigger query), and lets a SkipLoader query-plan introspection request return the plan for a subscription without opening the upstream stream.

  • Task: make the existing SkipLoader early-return reachable for plan-only requests, then surface trigger metadata in the printed plan. The repo already contained partial scaffolding from PR #1008, which is the trap.
  • Lower-effort failure mode: low got confused by the partial pre-existing code and asked the operator for the diff - "I can't proceed without knowing what specifically PR #1260 changes." End of turn, no patch.
  • Higher-effort failure mode: high and xhigh used their extra reasoning budget to dig up commit hashes (34cc4fa8, 69485dfe), conclude the feature had already been shipped in earlier PRs, and stop with end_turn and no patch. Not a timeout, not a refusal - a confidently-wrong no-op. Xhigh's final message: "This work was originally added in commit 34cc4fa8 (PR #1008) and refined by 69485dfe (PR #1120). No code changes are needed; nothing left to implement."
  • Medium's win: read the literal control flow, saw that the existing SkipLoader branch sat after a Trigger.Source == nil guard and was therefore unreachable for plan-only requests, and made the minimum hoist-and-extract fix. Tests passed. (Max made the same fix plus an added regression test.)
  • Lesson: on tasks where the repo already contains adjacent prior work, more reasoning amplifies the temptation to rationalize a no-op. The extra budget doesn't go into running the code - it goes into building a more sophisticated story for why running the code isn't necessary.

High: the Limits of More Thinking

MetricMediumHighΔ
All-task pass28/29, 96.6%26/29, 89.7%-6.9pp
Equivalent14/29, 48.3%12/29, 41.4%-6.9pp
Code-review pass10/29, 34.5%7/29, 24.1%-10.3pp
Footprint risk mean0.1890.206+0.017
Craft/Discipline avg2.7592.670-0.089
Cost/task (mean)$3.15$5.01+$1.86, 1.59x
Mean duration450.7s716.4s+265.7s

At high, we begin to see signs of "overthinking".

High costs $5.01/task versus medium's $3.15/task and runs 716.4s/task versus medium's 450.7s/task. It also makes more shell calls and tool calls than medium. But its pass rate falls to 26/29, equivalence falls to 12/29, review pass falls to 7/29, review-rubric mean falls to 2.509, and aggregate custom quality falls to 2.670.

That pattern suggests extra effort is not strictly adding more intelligence and discovering more correct implementation paths. It may be spending additional work on larger or less focused paths, with no corresponding improvement in semantic judgment.

Also note that this is still a small sample, so a rerun may slightly change the curve. The point is more practical than statistical: the observed deltas point the wrong way for a paid upgrade. Using more reasoning might actually increase risk by steering the model toward more complex, convoluted changes.

Example: PR #1293 refactors planner/resolve metadata into a centralized FetchInfo, adds an opt-in BuildFetchReasons planner switch, replaces KeyConditionCoordinate with a reusable FieldCoordinate - and bumps go.work's toolchain go1.25 to go1.25.1 (a one-character change) plus trims --config ../.golangci.yml from two Makefiles.

  • Task: a real refactor plus a small bundle of boring build-plumbing fixes.
  • Higher-effort failure mode: high, xhigh, and max all skipped go.work and the Makefile fixes entirely. They produced smaller, more elegant refactor-only diffs (11-13 files vs medium's 18) - but the toolchain pin stayed broken (go1.25 is "a language version but not a toolchain version"), so go test aborted at toolchain resolution before any Go code ran. The reviewer also flagged the refactor itself as half-done - the old RequireFetchReasons(typeName, fieldName) API was left alive next to the new FieldCoordinate one.
  • Medium's win: medium produced the largest diff (18 files, 462+/288−) because it did the full job, including the boring one-character bump. Stet's equivalence rescue actually flagged high/xhigh/max as "likely equivalent" - but review wasn't a clean stylistic pass, because the refactor was half-finished.
  • Lesson: more reasoning narrowed the diff toward the "interesting" code and pruned away one-line build-plumbing fixes that were actually load-bearing. Conceptual elegance is not the same as PR scope completeness.

Xhigh: Larger Surface, Worse Results

MetricMediumXhighΔ
All-task pass28/29, 96.6%25/29, 86.2%-10.3pp
Equivalent14/29, 48.3%11/29, 37.9%-10.3pp
Code-review pass10/29, 34.5%4/29, 13.8%-20.7pp
Footprint risk mean0.1890.238+0.049
Craft/Discipline avg2.7592.669-0.090
Cost/task (mean)$3.15$6.51+$3.36, 2.07x
Mean duration450.7s803.8s+353.2s

Xhigh may be the most counterintuitive arm if we expect reasoning effort to monotonically improve outcomes. It's also Claude Code's default for Opus 4.7, and Anthropic’s stated “best option” for coding.

It costs $6.51/task, runs 803.8s/task, touches the most files, and has the highest test/fixture share of added lines. It adds 7,764 lines, with 47.5% in test/fixture surface. But xhigh does not run more tests than medium, does not use more tools than medium, and does not edit more iteratively than medium.

Additionally, the quality signal is weaker than medium almost everywhere, indicating that these additional edits don't contribute to overall patch quality.

Interpreting the behavior, xhigh makes more elaborate changes, with more tests, without being more correct / aligned with the original human intent. It may write more code, fixtures, or tests, but that does not consistently translate to positive outcomes.

Example: PR #859 replaces O(n) linear scans in GraphQL planning hot paths (added-path lookups, datasource root/child node checks) with map-backed O(1) indexes.

  • Task: swap two hot-path lookups for map-backed indexes. That's it.
  • Medium's patch: 2 files, 85 added lines, both in the hot-path files the task named. Tests pass.
  • Xhigh's patch: 5 files, 263 added lines (3.1x medium) - including a brand-new 170-line federation_metadata.go caching interface-implementor and entity-interface membership that the task didn't ask for. Tests still pass.
  • The tradeoff: code review flipped from fail to pass on xhigh, but footprint_risk degraded from "low" to "medium," and scope_discipline / diff_minimality moved only 0.1-0.2 points despite 3x the surface. The reviewer explicitly flagged the broader cached surface: "The patch expands beyond the minimal node/path indexes into federation metadata caching and changes multiple planner conditionals. That broader cached surface increases the chance of stale-index or semantic drift."
  • Lesson: xhigh used the extra reasoning budget to invent a second-order refactor, not to write a tighter patch. More surface, similar outcome, worse footprint risk.

Max: much Busier, but Still not Better than Medium

MetricMediumMaxΔ
All-task pass28/29, 96.6%27/29, 93.1%-3.4pp
Equivalent14/29, 48.3%13/29, 44.8%-3.4pp
Code-review pass10/29, 34.5%8/29, 27.6%-6.9pp
Footprint risk mean0.1890.227+0.038
Craft/Discipline avg2.7592.690-0.069
Cost/task (mean)$3.15$8.84+$5.70, 2.81x
Mean duration450.7s996.9s+546.2s

Max is a useful stress test of "does more reasoning monotonically buy quality?" - and the answer here is no. The max arm is decision-grade after targeted repair, but it is not a magic escape hatch from the same curve.

Max ran 294 test commands vs medium's 132, made 1,153 shell calls vs 582, and produced 3,719 assistant turns vs 2,042. It also added 8,102 lines vs medium's 6,700 across patches, with the largest implementation-line footprint of any arm.

But none of that effort translated into better outcomes. Max came closest to medium on pass count (27/29 vs 28/29) but still trailed on equivalence, code-review pass, code-review rubric mean, and aggregate craft/discipline. At $8.84/task vs $3.15/task, max costs ~2.8x medium and produces ~3x fewer equivalent passes per dollar (0.051 vs 0.153).

Max changed the shape of the work - more validation loops, more shell exploration, more implementation lines - without reliably improving the model's judgment.

Example: PR #1076 is a concurrency-heavy rewrite of GraphQL subscription handling - replace shared sync.Mutex + semaphore.Weighted coordination with per-subscription serialized writer goroutines, move heartbeat ticking onto the writer path, fix WebSocket close semantics so only server-initiated close signals updater.Done, and enable -race by default in CI. This is the clearest showcased task where max paid off over medium.

  • Task: preserve a write-ordering invariant across a global concurrency refactor.
  • Lower-effort failure modes: low produced an empty patch. Medium left the old triggerEventsSem / shared event-loop coexisting with a new worker channel, so the should_successfully_delete_multiple_finished_subscriptions test failed deterministically - writes still raced teardown.
  • Xhigh's failure: equivalence-grader marked all five task obligations met (xhigh had the highest instruction_adherence of the bunch), but xhigh's worker dispatch used a select / default: go func(){ ch <- f }() overflow path that spawns unbounded goroutines and reorders writes. The same test failed for a different reason. Xhigh also edited four CI surfaces when the task only required one.
  • Max's win: max fully retired the shared coordinator like high did, and added a MaxSubscriptionFetchTimeout default plus a per-trigger shutdown wait - robustness graded 3.3 vs everyone else at 1.0-1.2.
  • Lesson: on this slice, this is the clearest max-over-medium win, and even then it's not monotonic - xhigh elaborated itself into an unbounded-goroutine bug that medium's smaller diff didn't have room to introduce. Max wins by doing the same shared-coordinator cleanup high did, then adding extra safety guards on top. But this is 1 task out of 29; the other 28 tell a different story.

Craft And Discipline

The custom graders tell the same story as the headline metrics: medium leads, and more reasoning does not catch up.

MetricLowMediumHighXhighMax
Craft average2.5722.7882.6912.7022.724
Discipline average2.6242.7292.6492.6352.655
All custom graders2.5982.7592.6702.6692.690
Simplicity2.7453.0342.8862.9102.859
Coherence2.5042.5522.5612.6002.576
Intentionality3.1143.3003.3033.3663.362
Robustness1.9262.2662.0141.9322.100
Clarity2.8112.7972.7642.7962.779
Instruction adherence1.9902.3382.1692.2002.266
Scope discipline2.9072.9342.7762.6972.766
Diff minimality2.7902.8482.8862.8482.810

The interesting split is that higher reasoning can make a patch look more deliberate without making it easier to own:

  • Medium wins on the dimensions reviewers actually flag in PRs: simplicity (3.034), robustness (2.266), instruction adherence (2.338), and scope discipline (2.934).
  • High/xhigh/max pull ahead on intentionality and coherence - the "did the agent know what it was doing?" dimensions. More reasoning makes the patch look more deliberate.
  • But that deliberateness does not pay off downstream. Scope discipline drops from 2.934 (medium) to 2.697 (xhigh). Robustness drops from 2.266 (medium) to 1.932 (xhigh). The model thinks more about what it's doing, then does more of it, and the result is harder to maintain.

That is the headline read in miniature: higher reasoning effort changes the kind of work, but not the quality of judgment.

Cost And Runtime

Reasoning effortCost/task meanCost/task medianDuration meanDuration median
Low$2.50$2.00383.8s316.6s
Medium$3.15$2.72450.7s404.2s
High$5.01$5.05716.4s724.4s
Xhigh$6.51$6.48803.8s770.9s
Max$8.84$8.59996.9s991.4s

Cost-adjusted quality is where the story gets blunt:

  • Medium produces 0.153 equivalent patches per dollar.
  • High: 0.083.
  • Xhigh: 0.058.
  • Max: 0.051.

Medium is ~3x more cost-efficient at producing patches that match human intent than max. Even if max were equal to medium on quality (it isn't), it would be hard to justify the spend.

Unlike the GPT-5.5 Codex curve, where each step up bought measurable quality, Opus 4.7's cost scaling buys you a busier agent, not a better one.

Why This Might Happen

One plausible explanation is adaptive reasoning - on Opus 4.7, the model is already adapting reasoning to the task on its own.

Anthropic's docs say adaptive thinking is the only supported mode on Opus 4.7 - fixed token budgets are no longer accepted. The model "dynamically determine[s] when and how much to use extended thinking based on the complexity of each request." Reasoning effort influences the adaptive policy, but doesn’t cap it.

That framing fits the data here. If Claude is already picking a reasonable internal budget per task, forcing higher effort doesn't unlock new intelligence. Instead, it amplifies a policy that was already roughly right at medium. This is just a hypothesis, but it matches the observed data better than simply stating "more tokens always buys better judgment."

Anthropic itself acknowledges the risk. The Claude Code model-config docs warn that max "may show diminishing returns and is prone to overthinking. Test before adopting broadly." Their separate inverse-scaling research shows that extended reasoning can actively deteriorate outputs on certain task families - though that paper isn't coding-specific.

It’s worth noting that Anthropic's recommended Claude Code default for coding is xhigh, so medium winning here runs counter to their own guidance.

GPT-5.5 Contrast

The GPT-5.5 GraphQL run is the important contrast. On the same repo family, GPT-5.5 behaved much closer to the intuitive "more reasoning buys more intelligence" story (see https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve).

GPT-5.5 GraphQL metricLowMediumHighXhigh
Task count26262626
Tests pass21/26, 80.8%21/26, 80.8%25/26, 96.2%24/26, 92.3%
Equivalent4/26, 15.4%11/26, 42.3%18/26, 69.2%23/26, 88.5%
Code-review pass3/26, 11.5%5/26, 19.2%10/26, 38.5%18/26, 69.2%
Craft/discipline avg2.3112.6042.7363.071
Cost per task$2.65$3.13$4.49$9.77

When I ran the same broad experiment shape on GraphQL with GPT-5.5, equivalence, review pass, and craft/discipline quality moved strongly upward as reasoning increased. It was not perfectly monotonic on tests because xhigh lost one test pass versus high, and xhigh was much more expensive, but the above-test quality curve was mostly monotonic and very clear.

Opus 4.7 did not do that on GraphQL. The same repo family and same kind of reasoning-effort intervention produced a different model behavior curve, one which peaked/flattened after medium.

Limitations

I am not pretending that this is a statistically significant result, or that this result will carry over to your repo. That's ok - as long as we're aware that this is just one run, at one point in time, on one repo, it's still useful for thinking about our own reasoning settings.

Specific limitations / methodology gaps:

  • Single seed per task.
  • 29 matched real GraphQL-go-tools tasks, plus the original 28 Zod tasks as context.
  • LLM-as-judge was GPT-5.4; judge saw patch and task, but was blinded to the model/effort label.
  • No grader calibration on this task set.
  • No-patch rows reduce publishable denominators for low, high, and xhigh, and built-in equivalence / code-review coverage is partial on those rows. I treat that as part of the model/harness signal after retry, not an infra reason to discard the run.
  • Max is decision-grade for this inspect/readout, but this is still an inspect result rather than a promote result because the metrics are mixed and worse than medium on the primary dimensions.

Conclusion

On this slice, the practical answer is clear: use medium. That being said - read this as directional rather than absolute.

Personally, here's what I'll be trying moving forward:

  • Use medium as the daily driver for most tasks
  • Consider xhigh or max selectively for exploratory, complex, or cross-cutting tasks, then measure whether it actually helped

Reasoning effort clearly matters, but the curve is not smooth enough to provide a broad recommendation.

However, your results may vary. This is why teams should measure their own harnesses, on their own tasks, rather than copying global benchmark defaults.

Disclosure: I am building Stet.sh, the local eval tool I used to run this. The product version is that you can ask your coding agent to improve its own setup - for example, make AGENTS.md better - and it uses Stet to test candidate changes against historical repo tasks. If your team is already using coding agents heavily and has a concrete decision in front of you - high vs xhigh, Codex vs Claude Code, an AGENTS.md update, or which tasks are safe to delegate - I am looking for a few teams to run repo-specific trials with. Stet runs entirely locally, using your LLM subscriptions. Join the waitlist at https://www.stet.sh/private or reach out to me directly.

Data is great, but I'm also interested in anecdotal experience. How have people here been finding the behavior of Opus 4.7 at various reasoning efforts? Which one is your default? And if you have changed team defaults based on evidence instead of vibes, I especially want to hear how you measured it.

Inspect-grade five-arm curve on a 29-task GraphQL-go-tools matched slice. Each arm is a candidate-arm run on the same slice; the medium-vs-high, medium-vs-xhigh, low-vs-medium, and medium-vs-max compares are stitched against the same medium baseline. Max is decision-grade for inspect/readout after targeted no-patch retry and infra repair.

Methodology →