Claude Opus 4.6: Benchmarks, Agent Teams & What Actually Changed
Anthropic's smartest model just got a massive upgrade—and this time it's not just about benchmarks. Claude Opus 4.6 has dropped: with the first 1M Token Context Window for an Opus model, multi-agent coordination in Claude Code, and cybersecurity capabilities that found over 500 real zero-day vulnerabilities before launch.
Here is what really counts.
Opus 4.6 vs Opus 4.5: What Changed
Opus 4.5 arrived in November 2025. Three months later, 4.6 builds on it with targeted upgrades in five areas: Context, Agents, Coding, Enterprise Workflows, and Developer Controls.
Context Window: 200K → 1M Tokens (Beta) The first Opus model with a million-token context window. In the MRCR v2 benchmark—a needle-in-a-haystack test—Opus 4.6 achieves a score of 76%. Sonnet 4.5 sits at 18.5%. This is not an incremental update. This is a qualitative shift in how much context the model can actually use.
Output Tokens: 128K Max Enough to write entire codebases or complete documents in a single pass. No more chunking complex tasks into multiple requests.
Agent Teams in Claude Code
The headline feature. Instead of one agent working through tasks sequentially, you spin up multiple agents that each own their part and coordinate in parallel. Think of it like delegating to a team instead of micromanaging a single person. Currently available as a Research Preview in Claude Code and via API. Activation: set CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 in your environment.
Adaptive Thinking The model can now decide based on context clues how much reasoning to apply. You no longer have to choose between "Thinking On" and "Thinking Off". Four effort levels: Low, Medium, High (Default), Max.
Context Compaction Claude can summarize its own older context during long-running tasks. No more "hitting walls" in the middle of complex multi-step jobs.
Claude Opus 4.6 Benchmarks: The Numbers
Here is how Opus 4.6 compares to the competition:
| Benchmark | Opus 4.6 | Opus 4.5 | GPT-5.2 |
|---|---|---|---|
| GDPval-AA (Elo) | 1,606 | +190 Elo below | +144 Elo below |
| Terminal-Bench 2.0 | 65.4% | — | Below |
| Humanity's Last Exam | #1 | Below | Below |
| BrowseComp | #1 | Below | Below |
| MRCR v2 (Context) | 76% | — | — |
| Finance Agent | 60.7% | 55.2% | Below |
| BigLaw Bench | 90.2% | Below | Below |
GDPval-AA is the most important metric here. It measures performance in economically valuable knowledge work—finance, legal, complex domains. Opus 4.6 beats GPT-5.2 by 144 Elo points. That translates to a win rate of approx. 70% in direct comparisons.
Terminal-Bench 2.0 with 65.4% is the highest score a model has ever achieved in agentic coding evaluation. On Humanity's Last Exam—complex multidisciplinary reasoning—Opus 4.6 leads all other frontier models. Period.
BigLaw Bench: 90.2%. Anthropic's System Card PDF confirms this value with 40% perfect scores. For Legal AI, this is a new benchmark—and the reason why Harvey (AI Legal Tool) is already working productively with it.
Opus 4.6 vs GPT-5.2: The Direct Comparison
The question everyone is asking: How does Opus 4.6 compare to OpenAI's newest model?
| Factor | Opus 4.6 | GPT-5.2 |
|---|---|---|
| Enterprise Tasks (GDPval-AA) | 1,606 Elo | 1,462 Elo |
| Context Window | 1M Tokens (Beta) | 1M Tokens |
| Max Output | 128K Tokens | 32K Tokens |
| Agentic Coding | 65.4% Terminal-Bench | Below |
| Pricing (Input) | $5/M Tokens | $10/M Tokens |
| Agent Teams | ✓ (Claude Code) | ✗ |
144 Elo points difference in enterprise tasks. That is not a rounding error—that is a win rate of ~70%. In output capacity and pricing, Opus 4.6 also has the lead. GPT-5.2's strength remains the broader ecosystem and multi-modal integration.
The 500 Zero-Days Story
Before launch, Anthropic's Frontier Red Team gave the model access to Python and vulnerability analysis tools in a sandboxed environment. No specialized instructions. No pre-loaded knowledge of specific gaps.
The model found over 500 previously unknown zero-day vulnerabilities in open-source code. Every single one was validated by Anthropic's team or external security researchers. The System Card PDF documents the entire process.
Weekly insights on AI Architecture. No spam.
What it found:
- A bug in GhostScript (the PDF/PostScript utility) that could crash systems.
- Buffer overflow bugs in OpenSC (smart card processing) and CGIF (GIF processing).
These are not "toy vulnerabilities"—this is critical infrastructure software.
Logan Graham, Head of Anthropic's Frontier Red Team: "I wouldn't be surprised if this becomes one of the—or the main way—open-source software is secured in the future."
Anthropic added six new cybersecurity probes to detect potentially harmful uses of these capabilities. They also plan real-time interventions to block abuse.
Claude Code Agent Teams: How It Works
Agent Teams are the feature most sought after—and rightfully so. Here's how it works:
Activation:
export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
What it does: Instead of a single Claude agent processing everything sequentially, you can spin up multiple subagents. Each owns a part of the problem and works in parallel. The main agent coordinates and merges the results.
Real-World Example (NBIM): Norway's sovereign wealth fund tested Agent Teams in 40 cybersecurity investigations. Each test involved up to 9 subagents and 100+ tool calls. Opus 4.6 beat the Claude 4.5 models 38 out of 40 times in blind ranking.
Rakuten: Opus 4.6 autonomously closed 13 GitHub issues and assigned 12 issues to the correct team members—in a single day.
Bolt.new: "It one-shotted a fully functional physics engine."
Currently available as a Research Preview in Claude Code and via API.
Enterprise & Productivity Updates
Claude in PowerPoint (Research Preview) Claude now lives as a side panel in PowerPoint. It reads your existing layouts, fonts, and templates and then generates or edits slides while maintaining your design system. Available for Max, Team, and Enterprise Plan customers in Beta.
Claude in Excel (Upgraded) Better at handling long, multi-step tasks. Complex financial models that previously required babysitting now work more reliably on the first pass.
Cowork Integration Opus 4.6 powers Cowork's autonomous multitasking. It can create documents, spreadsheets, and presentations simultaneously while you focus on reviewing the outputs.
Claude Opus 4.6 Pricing, API & Availability
| Detail | Value |
|---|---|
| API Model String | claude-opus-4-6 |
| Input Pricing | $5 per Million Tokens |
| Output Pricing | $25 per Million Tokens |
| Context Window | 1M Tokens (Beta) |
| Max Output Tokens | 128K |
| Effort Parameter | Low, Medium, High (Default), Max |
| Availability | claude.ai, API, AWS Bedrock, Google Cloud Vertex AI, Microsoft Foundry, GitHub Copilot |
Pricing unchanged from Opus 4.5. All major cloud platforms from Day One. Anthropic recommends dialing the Effort Parameter down to Medium if the model "overthinks" simple tasks.
Context: Why This Matters Now
This launch didn't happen in a vacuum. OpenAI released its Codex Desktop App three days ago, aiming directly at Claude Code's momentum. Software stocks have fallen by $285 billion because investors fear AI disruption. Claude Code reached $1 billion run-rate revenue just six months after the General Availability launch.
Opus 4.6 is Anthropic's answer to one question: Can AI move from chatbot to real knowledge worker? The Agent Teams, 1M context, and PowerPoint integration all point in the same direction—Claude does real work instead of just answering questions.
The Verdict
Opus 4.6 is the best model Anthropic has ever shipped. Not a revolutionary generational leap—it is Opus 4.5 with polished edges and genuinely new capabilities bolted on.
The 1M Context Window and Agent Teams are the real story. Context allows handling enterprise-scale codebases. Agent Teams allow parallelizing work that used to be sequential. The 500 zero-day findings are both impressive and slightly scary.
If you are already on Claude Code or Cowork, this is a free upgrade for the same price. If you were undecided before—this is the model that makes "AI as a team member" feel less like marketing and more like reality.