Claude Opus 4.8 Review: It Beat Codex on My Real Codebase

Today, May 28, 2026, Anthropic shipped Claude Opus 4.8. I don't usually rebuild my workflow around a point release — but I tested it for a few days against code I actually ship, and the driver's seat in my setup just changed hands.

Some context first. My daily AI work doesn't happen on toy problems. It happens inside a real production e-commerce platform: PrestaShop 8 on the back end, a Next.js multistore front end, REST overrides, aggressive caching, and a multishop catalog where a wrong assumption about stock or shop context costs actual revenue. For the heavy lifting, my go-to has been Codex, running GPT-5.5 — fast, cheap on tokens, and living right in my terminal. That's a high bar for anything trying to take its place.

So this isn't a benchmark recap. It's what happened when I pointed Claude Opus 4.8 at the messy stuff.

What Claude Opus 4.8 actually ships

The pitch is narrow and, to Anthropic's credit, refreshingly un-hyped: sharper judgment, more honesty about its own progress, and the stamina to work independently for longer. Anthropic itself describes it as a modest but tangible step up from 4.7 — not the language of a company overselling a launch.

What's new at a glance

Better judgment and honesty — roughly four times less likely than 4.7 to let a code flaw pass unremarked, and more willing to flag uncertainty than to bluff.

Dynamic Workflows — a Claude Code research preview that fans out hundreds of parallel subagents for big jobs like migrations across hundreds of thousands of lines.

Effort Control — a slider to trade depth of thinking for speed and rate-limit headroom, which matters if you've ever burned your allowance on a task that didn't need maximum horsepower.

Same price as 4.7 — $5 per million input tokens and $25 per million output in standard mode, with a cheaper fast mode at $10/$50.

The honesty upgrade

The headline trait is honesty, and it's more than marketing. Anthropic's alignment write-up reports notably lower rates of deceptive behavior — close, the company claims, to its much larger "Mythos" preview model. In practice it shows up as a model that tells you what it isn't sure about instead of papering over it.

The benchmarks, read honestly

Claude Opus 4.8 tops most of the coding tables. On SWE-Bench Pro — agentic, multi-step coding — early reporting puts it near 69%, up from 4.7's ~64% and ahead of GPT-5.5. On at least one "senior engineer" evaluation it edges GPT-5.5 by a single point while leaving its own predecessor about 30 points behind. Its computer-use scores are strong too, with one team reporting 84% on Online-Mind2Web.

But here's what the launch-day threads tend to skip: Codex still owns the terminal. On Terminal-Bench 2.1, GPT-5.5 leads; Claude Opus 4.8 jumped a lot versus 4.7 but sits behind. And GPT-5.5 uses roughly 72% fewer output tokens for comparable work — a genuine cost and latency advantage, not a rounding error.

Read together, the story is clean: Claude Opus 4.8 is the stronger reasoner, GPT-5.5/Codex the leaner operator. Which one wins depends entirely on the shape of your work — and on a launch day full of victory laps, it's worth saying that out loud.

Claude Opus 4.8 versus Codex — a hands-on coding comparison

My actual test: the bug Codex couldn't close

Now the honest part — a real bug from this week.

On one of my storefronts, add-to-cart had an ugly cold-load glitch. The very first click on a freshly loaded page flashed a red error toast for a split second before the green success toast appeared. Nothing was actually failing, but on a dark, orange-accented theme that red flash read like a crash. It's the kind of small, embarrassing bug that's hard to pin down because it only shows up on a cold load and then hides.

Codex had already taken a swing at it and made the toast feedback faster. Faster, but the flash survived — because speed wasn't the problem. The problem was sequence: the button had no loading state, and the notification briefly rendered a non-success state before the request came back.

When I handed Claude Opus 4.8 the same problem, it didn't patch the symptom. It traced the state flow, named the timing issue, and then changed the approach instead of nudging the timing: an optimistic toast. Show the green success state instantly on click, then either confirm it — and start its auto-hide — when the API succeeds, or morph that same toast into an error if it fails. One toast, no competing timers, no flash.

What actually sold me wasn't the diff. It was the behavior around it. It refused to call the fix done on my word. It wired up a MutationObserver, clicked add-to-cart, and captured the real sequence of toast states from the first rendered frame — proving there was exactly one green toast and zero red flash. Then it deployed to staging and ran the same proof against the live staging URL before it would say "fixed."

That is the honesty trait Anthropic is selling, except as behavior rather than a bullet point. It told me what it wasn't sure of, and then it verified instead of asserting. After a year of AI tools cheerfully declaring victory on code that didn't actually work, that's the part that moved my defaults.

Why "honest" beats "smart" in production

It sounds like a soft, marketing-friendly virtue. In my experience it's the opposite — it's the difference between a tool you can leave running and one you have to babysit.

A model that's slightly smarter but overconfident will happily write a plausible fix, tell you it's done, and leave you to discover at 2 a.m. that it silently swallowed an error or made an unsupported assumption about shop context. A model that's a touch more careful — that pushes back on a weak plan, flags the part it's unsure about, and insists on proving the change — is one I can actually delegate to. Judgment compounds over a long session in a way that a single benchmark point never captures.

That's the shift I felt this week, and it lines up with what early enterprise testers report: fewer check-ins, longer unattended runs, fewer "are you sure?" moments.

Where Codex still earns its seat

I'm not retiring Codex, and you probably shouldn't either. In my testing, each tool has a clear lane:

Reach for Claude Opus 4.8 when you need planning, judgment, error recovery, and gnarly multi-file reasoning — long-horizon work where quality per step matters most.

Reach for Codex (GPT-5.5) for tight terminal loops, quick mechanical edits, and high-throughput tasks where token cost and raw speed dominate.

What everyone else is saying

The reactions skew positive, and — unusually — they cluster on judgment rather than speed. Engineers at Cursor (Michael Truell) and at Cognition, the team behind Devin (Scott Wu), pointed to cleaner tool use and the consistency unattended, long-running agents need. A Shopify staff engineer, Tom Pritchard, singled out its better judgment and habit of catching its own mistakes. Most of the press coverage leaned on honesty as the standout story.

The skeptics are worth keeping in the room, though. It's day one. Anthropic's own "modest but tangible" is doing real work in that sentence. A larger Mythos-class model is already teased for the coming weeks, which makes today's flagship a moving target. And launch benchmarks plus first impressions are not the same thing as months of production scars.

My verdict on Claude Opus 4.8

For the work I actually care about — long-horizon, judgment-heavy changes in a codebase where a wrong guess costs money — Claude Opus 4.8 is my new default. Not because a chart reads 69 to 62, but because it pushed back on a weak plan, found a race condition an earlier, faster pass had walked straight past, and proved its own fix instead of taking a victory lap.

Codex keeps its seat for speed and cost, and the driver/worker split is still how I get the most done in a day. But this week, the driver changed.

And in the spirit of the model itself, the honest caveat: this is a first impression, on one real codebase, days into release. Take it as a strong signal, not a universal verdict. Then go point Claude Opus 4.8 at your own ugliest repo — that's the only benchmark that has ever mattered.

FAQ

Is Claude Opus 4.8 better than GPT-5.5 (Codex)?+

It depends on the task. Opus 4.8 leads on agentic-coding and reasoning benchmarks like SWE-Bench Pro and edges GPT-5.5 on senior-engineering evaluations. But GPT-5.5 still wins terminal-coding benchmarks and uses roughly 72% fewer output tokens, which makes Codex cheaper and faster for many jobs. In practice, many developers use Claude as the planner and reviewer and Codex as the fast executor.

When was Claude Opus 4.8 released?+

Anthropic released Claude Opus 4.8 on May 28, 2026, about a month after Opus 4.7. It is available immediately through the Claude API, Claude Code, and claude.ai.

How much does Claude Opus 4.8 cost?+

Pricing is unchanged from Opus 4.7: $5 per million input tokens and $25 per million output tokens in standard mode. A faster mode is available at $10 per million input and $50 per million output tokens.

What is actually new in Claude Opus 4.8?+

The headline changes are sharper judgment and honesty. Anthropic says it is about four times less likely than 4.7 to let code flaws pass unremarked, plus new Claude Code features like Dynamic Workflows (hundreds of parallel subagents, in research preview) and an Effort Control slider that trades depth of reasoning for speed and rate limits.

How does Claude Opus 4.8 score on SWE-Bench Pro?+

Early reporting puts Claude Opus 4.8 near 69% on SWE-Bench Pro, up from Opus 4.7's roughly 64% and ahead of GPT-5.5. On agentic terminal coding (Terminal-Bench 2.1) it improved sharply over 4.7 but still trails GPT-5.5, so benchmark leadership depends on the specific task.

What are Dynamic Workflows in Claude Code?+

Dynamic Workflows are a research-preview feature that lets Claude Opus 4.8 spin up hundreds of parallel subagents inside Claude Code to tackle large-scale tasks, such as migrations spanning hundreds of thousands of lines of code, with less manual supervision.

What is the Effort Control slider in Claude Opus 4.8?+

Effort Control is a setting in claude.ai and Cowork that balances how deeply Claude Opus 4.8 thinks against speed and rate-limit usage. Higher effort means more thorough reasoning; lower effort returns faster answers and consumes your rate limit more slowly.

Is Claude Opus 4.8 really more honest than previous models?+

Anthropic reports that Opus 4.8 is roughly four times less likely than Opus 4.7 to let a code flaw pass unremarked, is more likely to flag uncertainty than to make unsupported claims, and showed substantially lower rates of deceptive behavior in alignment testing. In hands-on use this shows up as the model verifying its own work instead of simply declaring success.

How do I access Claude Opus 4.8?+

Claude Opus 4.8 is available through the Claude API using the model name claude-opus-4-8, inside Claude Code, and in claude.ai and Cowork. Pricing and rate limits depend on the mode and effort level you select.

Should I switch from Codex to Claude Opus 4.8?+

For long-horizon, judgment-heavy production work, Opus 4.8 is a strong new default. For terminal-heavy, cost-sensitive, or high-throughput tasks, keep Codex in the mix too. Running both and letting each play to its strengths is the pragmatic move.

What is Claude Mythos?+

Mythos is a larger, more capable class of Anthropic model teased for release in the coming weeks. At the time of Opus 4.8's launch it appeared only as a preview in some benchmark comparisons and could not yet be purchased, so it sits outside a practical Opus 4.8 versus GPT-5.5 comparison today.

✻

Back to home