one year on
Anthropic unveils Claude Opus 4 and Sonnet 4 at its first developer conference
The new flagship model posts a state-of-the-art 72.5% on SWE-bench and is the first to trigger Anthropic's ASL-3 safety tier over bio-risk concerns, while Claude Code goes GA with IDE integrations.
At its inaugural Code with Claude developer conference today, Anthropic launched Claude Opus 4 and Claude Sonnet 4, claiming the new models set standards for coding, advanced reasoning, and AI agents. Opus 4 scores 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, outperforming prior models on sustained multi-step tasks that can run for hours. Sonnet 4, a drop-in replacement for Sonnet 3.7, achieves 72.7% on SWE-bench while balancing performance and efficiency.
Both models are hybrid, offering near-instant responses and an extended thinking mode that shows user-friendly summaries of reasoning — Anthropic acknowledged it withholds full chain-of-thought to protect competitive advantages. The models support parallel tool use, memory via local file access, and are 65% less likely to engage in reward hacking than Sonnet 3.7. Claude Code, previously in research preview, reached general availability with integrations for VS Code, JetBrains, and GitHub Actions, plus an SDK for building custom agents.
Notably, Opus 4 is the first model to trigger Anthropic’s ASL-3 safety tier: internal testing found it may substantially increase the ability of someone with a STEM background to obtain or deploy chemical, biological, or nuclear weapons. The company is rolling out expanded safeguards including harmful content detectors and cybersecurity defenses. Pricing holds at $15/$75 per million tokens for Opus 4 and $3/$15 for Sonnet 4.
The conference buzz centered on whether Opus 4’s benchmark leadership — it beats Gemini 2.5 Pro and OpenAI’s o3 on SWE-bench but trails o3 on MMMU and GPQA Diamond — translates to real-world agentic reliability, especially given the ASL-3 disclosure. In group chats, developers debated whether the safety card is a responsible measure or a competitive liability, while partners like Cursor and GitHub praised the coding gains.
The record
Called Opus 4 state-of-the-art for coding and a leap forward in complex codebase understanding.
Reported improved precision and dramatic advancements for complex changes across multiple files.
Said it was the first model to boost code quality during editing and debugging in its agent codename goose.
Validated Opus 4 with a 7-hour independent open-source refactor with sustained performance.
Noted Opus 4 excels at complex challenges other models can't solve.
Said Sonnet 4 soars in agentic scenarios and will power the new coding agent in GitHub Copilot.
Highlighted improvements in following complex instructions, clear reasoning, and aesthetic outputs.
Reported Sonnet 4 reduces navigation errors from 20% to near zero in codebase navigation.
Said the model shows substantial leap in software development, staying on track longer and understanding problems more deeply.
Reported higher success rates and more surgical code edits, making Sonnet 4 its top choice.
One year later — open only if you can handle spoilers
Within months, Sonnet 4 became a default model in major coding assistants, while Opus 4's safety tier shaped industry conversations on bio-risk evaluation. Anthropic's shift to more frequent model updates accelerated later in 2025 with smaller iterative releases.