Claude Opus 4.6: Benchmarks, Agent Teams & What Actually Changed

Anthropic's smartest model just got a massive upgrade—and this time it's not just about benchmarks. Claude Opus 4.6 has dropped: with the first 1M Token Context Window for an Opus model, multi-agent coordination in Claude Code, and cybersecurity capabilities that found over 500 real zero-day vulnerabilities before launch.

Here is what really counts.

Opus 4.6 vs Opus 4.5: What Changed

Opus 4.5 arrived in November 2025. Three months later, 4.6 builds on it with targeted upgrades in five areas: Context, Agents, Coding, Enterprise Workflows, and Developer Controls.

Context Window: 200K → 1M Tokens (Beta) The first Opus model with a million-token context window. In the MRCR v2 benchmark—a needle-in-a-haystack test—Opus 4.6 achieves a score of 76%. Sonnet 4.5 sits at 18.5%. This is not an incremental update. This is a qualitative shift in how much context the model can actually use.

Output Tokens: 128K Max Enough to write entire codebases or complete documents in a single pass. No more chunking complex tasks into multiple requests.

Agent Teams in Claude Code The headline feature. Instead of one agent working through tasks sequentially, you spin up multiple agents that each own their part and coordinate in parallel. Think of it like delegating to a team instead of micromanaging a single person. Currently available as a Research Preview in Claude Code and via API. Activation: set CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 in your environment.

Adaptive Thinking The model can now decide based on context clues how much reasoning to apply. You no longer have to choose between "Thinking On" and "Thinking Off". Four effort levels: Low, Medium, High (Default), Max.

Context Compaction Claude can summarize its own older context during long-running tasks. No more "hitting walls" in the middle of complex multi-step jobs.

Claude Opus 4.6 Benchmarks: The Numbers

Here is how Opus 4.6 compares to the competition:

Benchmark	Opus 4.6	Opus 4.5	GPT-5.2
GDPval-AA (Elo)	1,606	+190 Elo below	+144 Elo below
Terminal-Bench 2.0	65.4%	—	Below
Humanity's Last Exam	#1	Below	Below
BrowseComp	#1	Below	Below
MRCR v2 (Context)	76%	—	—
Finance Agent	60.7%	55.2%	Below
BigLaw Bench	90.2%	Below	Below

GDPval-AA is the most important metric here. It measures performance in economically valuable knowledge work—finance, legal, complex domains. Opus 4.6 beats GPT-5.2 by 144 Elo points. That translates to a win rate of approx. 70% in direct comparisons.

Terminal-Bench 2.0 with 65.4% is the highest score a model has ever achieved in agentic coding evaluation. On Humanity's Last Exam—complex multidisciplinary reasoning—Opus 4.6 leads all other frontier models. Period.

BigLaw Bench: 90.2%. Anthropic's System Card PDF confirms this value with 40% perfect scores. For Legal AI, this is a new benchmark—and the reason why Harvey (AI Legal Tool) is already working productively with it.

Opus 4.6 vs GPT-5.2: The Direct Comparison

The question everyone is asking: How does Opus 4.6 compare to OpenAI's newest model?

Factor	Opus 4.6	GPT-5.2
Enterprise Tasks (GDPval-AA)	1,606 Elo	1,462 Elo
Context Window	1M Tokens (Beta)	1M Tokens
Max Output	128K Tokens	32K Tokens
Agentic Coding	65.4% Terminal-Bench	Below
Pricing (Input)	$5/M Tokens	$10/M Tokens
Agent Teams	✓ (Claude Code)	✗

144 Elo points difference in enterprise tasks. That is not a rounding error—that is a win rate of ~70%. In output capacity and pricing, Opus 4.6 also has the lead. GPT-5.2's strength remains the broader ecosystem and multi-modal integration.

The 500 Zero-Days Story

Before launch, Anthropic's Frontier Red Team gave the model access to Python and vulnerability analysis tools in a sandboxed environment. No specialized instructions. No pre-loaded knowledge of specific gaps.

The model found over 500 previously unknown zero-day vulnerabilities in open-source code. Every single one was validated by Anthropic's team or external security researchers. The System Card PDF documents the entire process.

Newsletter

Weekly insights on AI Architecture. No spam.

What it found:

A bug in GhostScript (the PDF/PostScript utility) that could crash systems.
Buffer overflow bugs in OpenSC (smart card processing) and CGIF (GIF processing).

These are not "toy vulnerabilities"—this is critical infrastructure software.

Logan Graham, Head of Anthropic's Frontier Red Team: "I wouldn't be surprised if this becomes one of the—or the main way—open-source software is secured in the future."

Anthropic added six new cybersecurity probes to detect potentially harmful uses of these capabilities. They also plan real-time interventions to block abuse.

Claude Code Agent Teams: How It Works

Agent Teams are the feature most sought after—and rightfully so. Here's how it works:

Activation:

export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1

What it does: Instead of a single Claude agent processing everything sequentially, you can spin up multiple subagents. Each owns a part of the problem and works in parallel. The main agent coordinates and merges the results.

Real-World Example (NBIM): Norway's sovereign wealth fund tested Agent Teams in 40 cybersecurity investigations. Each test involved up to 9 subagents and 100+ tool calls. Opus 4.6 beat the Claude 4.5 models 38 out of 40 times in blind ranking.

Rakuten: Opus 4.6 autonomously closed 13 GitHub issues and assigned 12 issues to the correct team members—in a single day.

Bolt.new: "It one-shotted a fully functional physics engine."

Currently available as a Research Preview in Claude Code and via API.

Enterprise & Productivity Updates

Claude in PowerPoint (Research Preview) Claude now lives as a side panel in PowerPoint. It reads your existing layouts, fonts, and templates and then generates or edits slides while maintaining your design system. Available for Max, Team, and Enterprise Plan customers in Beta.

Claude in Excel (Upgraded) Better at handling long, multi-step tasks. Complex financial models that previously required babysitting now work more reliably on the first pass.

Cowork Integration Opus 4.6 powers Cowork's autonomous multitasking. It can create documents, spreadsheets, and presentations simultaneously while you focus on reviewing the outputs.

Claude Opus 4.6 Pricing, API & Availability

Detail	Value
API Model String	`claude-opus-4-6`
Input Pricing	$5 per Million Tokens
Output Pricing	$25 per Million Tokens
Context Window	1M Tokens (Beta)
Max Output Tokens	128K
Effort Parameter	Low, Medium, High (Default), Max
Availability	claude.ai, API, AWS Bedrock, Google Cloud Vertex AI, Microsoft Foundry, GitHub Copilot

Pricing unchanged from Opus 4.5. All major cloud platforms from Day One. Anthropic recommends dialing the Effort Parameter down to Medium if the model "overthinks" simple tasks.

Context: Why This Matters Now

This launch didn't happen in a vacuum. OpenAI released its Codex Desktop App three days ago, aiming directly at Claude Code's momentum. Software stocks have fallen by $285 billion because investors fear AI disruption. Claude Code reached $1 billion run-rate revenue just six months after the General Availability launch.

Opus 4.6 is Anthropic's answer to one question: Can AI move from chatbot to real knowledge worker? The Agent Teams, 1M context, and PowerPoint integration all point in the same direction—Claude does real work instead of just answering questions.

The Verdict

Opus 4.6 is the best model Anthropic has ever shipped. Not a revolutionary generational leap—it is Opus 4.5 with polished edges and genuinely new capabilities bolted on.

The 1M Context Window and Agent Teams are the real story. Context allows handling enterprise-scale codebases. Agent Teams allow parallelizing work that used to be sequential. The 500 zero-day findings are both impressive and slightly scary.

If you are already on Claude Code or Cowork, this is a free upgrade for the same price. If you were undecided before—this is the model that makes "AI as a team member" feel less like marketing and more like reality.

Claude Opus 4.6: Benchmarks, Agent Teams & What Actually Changed

Claude Opus 4.6: Benchmarks, Agent Teams & What Actually Changed

Opus 4.6 vs Opus 4.5: What Changed

Claude Opus 4.6 Benchmarks: The Numbers

Opus 4.6 vs GPT-5.2: The Direct Comparison

The 500 Zero-Days Story

Claude Code Agent Teams: How It Works

Enterprise & Productivity Updates

Claude Opus 4.6 Pricing, API & Availability

Context: Why This Matters Now

The Verdict

Weekly insights on AI Architecture

Frequently Asked Questions

Let's
connect.

Claude Opus 4.6: Benchmarks, Agent Teams & What Actually Changed

Opus 4.6 vs Opus 4.5: What Changed

Claude Opus 4.6 Benchmarks: The Numbers

Opus 4.6 vs GPT-5.2: The Direct Comparison

The 500 Zero-Days Story

Claude Code Agent Teams: How It Works

Enterprise & Productivity Updates

Claude Opus 4.6 Pricing, API & Availability

Context: Why This Matters Now

The Verdict

Weekly insights on AI Architecture

Frequently Asked Questions

Let's connect.

Let's
connect.