Today, May 28, 2026, Anthropic shipped Claude Opus 4.8. I don't usually rebuild my workflow around a point release — but I tested it for a few days against code I actually ship, and the driver's seat in my setup just changed hands.
Some context first. My daily AI work doesn't happen on toy problems. It happens inside a real production e-commerce platform: PrestaShop 8 on the back end, a Next.js multistore front end, REST overrides, aggressive caching, and a multishop catalog where a wrong assumption about stock or shop context costs actual revenue. For the heavy lifting, my go-to has been Codex, running GPT-5.5 — fast, cheap on tokens, and living right in my terminal. That's a high bar for anything trying to take its place.
So this isn't a benchmark recap. It's what happened when I pointed Claude Opus 4.8 at the messy stuff.
What Claude Opus 4.8 actually ships
The pitch is narrow and, to Anthropic's credit, refreshingly un-hyped: sharper judgment, more honesty about its own progress, and the stamina to work independently for longer. Anthropic itself describes it as a modest but tangible step up from 4.7 — not the language of a company overselling a launch.
What's new at a glance
The honesty upgrade
The headline trait is honesty, and it's more than marketing. Anthropic's alignment write-up reports notably lower rates of deceptive behavior — close, the company claims, to its much larger "Mythos" preview model. In practice it shows up as a model that tells you what it isn't sure about instead of papering over it.
The benchmarks, read honestly
Claude Opus 4.8 tops most of the coding tables. On SWE-Bench Pro — agentic, multi-step coding — early reporting puts it near 69%, up from 4.7's ~64% and ahead of GPT-5.5. On at least one "senior engineer" evaluation it edges GPT-5.5 by a single point while leaving its own predecessor about 30 points behind. Its computer-use scores are strong too, with one team reporting 84% on Online-Mind2Web.
But here's what the launch-day threads tend to skip: Codex still owns the terminal. On Terminal-Bench 2.1, GPT-5.5 leads; Claude Opus 4.8 jumped a lot versus 4.7 but sits behind. And GPT-5.5 uses roughly 72% fewer output tokens for comparable work — a genuine cost and latency advantage, not a rounding error.
Read together, the story is clean: Claude Opus 4.8 is the stronger reasoner, GPT-5.5/Codex the leaner operator. Which one wins depends entirely on the shape of your work — and on a launch day full of victory laps, it's worth saying that out loud.

My actual test: the bug Codex couldn't close
Now the honest part — a real bug from this week.
On one of my storefronts, add-to-cart had an ugly cold-load glitch. The very first click on a freshly loaded page flashed a red error toast for a split second before the green success toast appeared. Nothing was actually failing, but on a dark, orange-accented theme that red flash read like a crash. It's the kind of small, embarrassing bug that's hard to pin down because it only shows up on a cold load and then hides.
Codex had already taken a swing at it and made the toast feedback faster. Faster, but the flash survived — because speed wasn't the problem. The problem was sequence: the button had no loading state, and the notification briefly rendered a non-success state before the request came back.
When I handed Claude Opus 4.8 the same problem, it didn't patch the symptom. It traced the state flow, named the timing issue, and then changed the approach instead of nudging the timing: an optimistic toast. Show the green success state instantly on click, then either confirm it — and start its auto-hide — when the API succeeds, or morph that same toast into an error if it fails. One toast, no competing timers, no flash.
What actually sold me wasn't the diff. It was the behavior around it. It refused to call the fix done on my word. It wired up a MutationObserver, clicked add-to-cart, and captured the real sequence of toast states from the first rendered frame — proving there was exactly one green toast and zero red flash. Then it deployed to staging and ran the same proof against the live staging URL before it would say "fixed."
That is the honesty trait Anthropic is selling, except as behavior rather than a bullet point. It told me what it wasn't sure of, and then it verified instead of asserting. After a year of AI tools cheerfully declaring victory on code that didn't actually work, that's the part that moved my defaults.
Why "honest" beats "smart" in production
It sounds like a soft, marketing-friendly virtue. In my experience it's the opposite — it's the difference between a tool you can leave running and one you have to babysit.
A model that's slightly smarter but overconfident will happily write a plausible fix, tell you it's done, and leave you to discover at 2 a.m. that it silently swallowed an error or made an unsupported assumption about shop context. A model that's a touch more careful — that pushes back on a weak plan, flags the part it's unsure about, and insists on proving the change — is one I can actually delegate to. Judgment compounds over a long session in a way that a single benchmark point never captures.
That's the shift I felt this week, and it lines up with what early enterprise testers report: fewer check-ins, longer unattended runs, fewer "are you sure?" moments.
Where Codex still earns its seat
I'm not retiring Codex, and you probably shouldn't either. In my testing, each tool has a clear lane:
The setup I've settled on is the one a lot of people are quietly converging on: Claude in the driver's seat, Codex as the worker for fast, cheap execution. I liked that division enough that I built a small open-source Claude Code skill to let the models review each other's code — AI peer review across Claude, Codex, and Gemini→ — and Claude Opus 4.8 makes Claude a noticeably sharper reviewer in that loop. It's the same instinct behind testing agents head-to-head→ instead of trusting a spec sheet: the only review that counts is the one on your own machine.
What everyone else is saying
The reactions skew positive, and — unusually — they cluster on judgment rather than speed. Engineers at Cursor (Michael Truell) and at Cognition, the team behind Devin (Scott Wu), pointed to cleaner tool use and the consistency unattended, long-running agents need. A Shopify staff engineer, Tom Pritchard, singled out its better judgment and habit of catching its own mistakes. Most of the press coverage leaned on honesty as the standout story.
The skeptics are worth keeping in the room, though. It's day one. Anthropic's own "modest but tangible" is doing real work in that sentence. A larger Mythos-class model is already teased for the coming weeks, which makes today's flagship a moving target. And launch benchmarks plus first impressions are not the same thing as months of production scars.
My verdict on Claude Opus 4.8
For the work I actually care about — long-horizon, judgment-heavy changes in a codebase where a wrong guess costs money — Claude Opus 4.8 is my new default. Not because a chart reads 69 to 62, but because it pushed back on a weak plan, found a race condition an earlier, faster pass had walked straight past, and proved its own fix instead of taking a victory lap.
Codex keeps its seat for speed and cost, and the driver/worker split is still how I get the most done in a day. But this week, the driver changed.
And in the spirit of the model itself, the honest caveat: this is a first impression, on one real codebase, days into release. Take it as a strong signal, not a universal verdict. Then go point Claude Opus 4.8 at your own ugliest repo — that's the only benchmark that has ever mattered.


