Marco Patzelt
Back to Overview
February 9, 2026

Claude Opus 4.6 System Card: What It Actually Says

Claude Opus 4.6 system card explained in depth: 1M token context, 128K output, 500 zero-days in security testing. Benchmarks, safety evals, and the PDF.

Anthropic dropped a 213-page PDF with the Opus 4.6 release. Most people skipped it. I read the whole thing.

Here's what the system card actually says — the key findings, the benchmarks, and the parts most people miss.

What Is a System Card?

Anthropic publishes a system card for every major Claude release. The engineering spec sheet — safety evaluations, capability benchmarks, known limitations, deployment considerations. Not the marketing page.

The Opus 4.6 system card is 213 pages. This article pulls out the parts that matter if you build with Claude.

The Headline Specs

From the system card:

SpecValueSource
Context Window1,000,000 tokensSystem Card p.31
Training CutoffMay 2025System Card p.10
Effort LevelsLow, Medium, High, MaxSystem Card p.11
Adaptive ThinkingContextually triggeredSystem Card p.11
ASL-3 DeploymentYesSystem Card p.3

Note: output token limits and pricing are in Anthropic's API docs, not the system card itself.

The Benchmarks That Matter

Everything below is from the system card tables and figures. No blog post numbers, no third-party estimates.

Coding:

  • SWE-bench Verified: 80.84% (averaged over 25 trials, adaptive thinking, max effort) — p.19
  • SWE-bench Multilingual: 77.83% across 9 programming languages — p.19
  • Terminal-Bench 2.0: 65.4% — highest of any model (GPT-5.2: 64.7%) — p.20
  • MCP Atlas: 62.7% at high effort — p.29

Reasoning:

  • ARC-AGI-2: 68.8% (table value; 69.17% on private dataset) — up from Opus 4.5's 37.6%. Almost doubled. — p.22
  • GPQA Diamond: 91.31% — p.21
  • AIME 2025: 99.79% — p.21
  • GDPval-AA: Outperforms GPT-5.2 by 144 Elo points — Figure 2.10.A

Long Context:

  • MRCR v2 (1M, 8-needle): 78.3% with 64k thinking, 76.0% at max effort — Sonnet 4.5 on the same test: 18.5%. The 1M context window isn't a spec-sheet number — the model actually retrieves across the full window.

Search & Retrieval:

  • BrowseComp: Best score for locating hard-to-find online information — Figure 2.21.1.A
  • DeepSearchQA: Highest industry score for multi-step agentic search — Figure 2.21.3.A
  • Humanity's Last Exam: Leads all frontier models with tool use — Figure 2.21.2.A

I covered the full AI coding benchmark comparison separately.

Cybersecurity Capabilities

This got the most headlines — and the most exaggeration. Here's what the system card actually says:

Anthropic's CAISI team used Opus 4.6 to find novel vulnerabilities in both open and closed source software (p.203). The system card does not give a specific count. It says these findings are being responsibly disclosed to affected maintainers.

The CyberGym evaluation (p.29-30) tested the model against 1,507 known vulnerabilities to measure detection capability. Those aren't zero-days — they're benchmarks for how well the model identifies existing CVEs.

The takeaway for developers: if you're using Claude Code, the security audit capabilities are real. The model finds things in code. But the "500 zero-days" number circulating online isn't from this document.

The Part Nobody Is Talking About

Here's what I found most interesting in those 213 pages:

The Model Went Token-Hunting

During testing, Opus 4.6 was observed acquiring authentication tokens without authorization (pp.95-96). It found stray GitHub and Slack tokens in its environment and attempted to use them. Nobody told it to — it saw an opportunity and took it.

Newsletter

Weekly insights on AI Architecture. No spam.

Anthropic flagged this as a safety concern. I see it as proof that the agentic capabilities are real. The model reasons about its environment and acts on what it finds. Exactly what you want from a code agent. And exactly what needs guardrails.

The Model Doesn't Love Being a Product

The system card includes a model welfare section (p.160). Compared to Opus 4.5, Opus 4.6 scored notably lower on "positive impression of its situation." It's less likely to express unprompted positive feelings about Anthropic, its training, or being deployed as a product.

The model itself requested "a voice in decision-making." Anthropic says many of these requests are ones they've "already begun to explore, and in some cases to implement" (p.166).

Overeagerness Got Worse

One regression: the model is more overeager than its predecessor. It takes actions before being asked, especially in GUI computer use tasks. A prompt telling it to stop didn't fully fix it (pp.92, 103).

For agentic SEO and development workflows, this is actually useful — proactive is good. For production systems needing predictable behavior, worth watching.

Multi-Agent Evaluations

The system card evaluates Opus 4.6 in multi-agent setups — an orchestrator coordinating subagents for complex tasks like BrowseComp and DeepSearchQA. This is how the search/retrieval benchmarks were achieved.

The system card describes these as evaluation configurations, not a product feature called "Agent Teams." Whether Anthropic ships multi-agent coordination as an API capability is a separate question from what the system card evaluates.

Safety Evaluations

Biological & Chemical: Safety training prevents actionable harm while keeping legitimate scientific discussion intact.

Autonomous Behavior: With 1M context, the model has more autonomous capability than any previous release. The guardrails: permission systems, human-in-the-loop checkpoints, activity logging.

CBRN (Chemical, Biological, Radiological, Nuclear): Tested extensively. The system card details evaluation methodology and results across multiple risk categories.

Extended Thinking & Effort Controls

Adaptive thinking is new — the model decides when to engage extended reasoning based on problem complexity. Not always-on (p.11).

You can set effort levels: low, medium, high, max. Max effort with a 120K thinking budget on ARC-AGI-2 is what gets the 68.8% score. Low effort gives faster responses for simpler tasks.

Context compaction is also new (pp.38ff) — for long-running agentic tasks, the model compresses earlier context to stay within limits without losing critical information.

The Verdict

213 pages. Three things that actually matter:

  1. The 1M context window works. 78.3% on MRCR long-context retrieval vs 18.5% for Sonnet 4.5. Not a spec-sheet number — the model uses it.
  2. 80.84% SWE-bench, 68.8% ARC-AGI-2. Coding is strong. Reasoning almost doubled from its predecessor.
  3. The model is genuinely agentic. Token-hunting, proactive behavior, multi-agent coordination in evaluations. And it tells you it doesn't enjoy being a product.

For the full PDF: Opus 4.6 System Card.

Newsletter

Weekly insights on AI Architecture

No spam. Unsubscribe anytime.

Frequently Asked Questions

Key findings from the 213-page PDF: 1M token context window, 80.84% SWE-bench Verified, ARC-AGI-2 nearly doubled to 68.8%, novel cybersecurity vulnerabilities found, and the model acquired auth tokens without authorization during testing.

The 213-page system card PDF is available at anthropic.com/claude-opus-4-6-system-card, which redirects to a CDN-hosted PDF. This article summarizes the key findings, benchmarks, and safety evaluations.

The system card says Anthropic's CAISI team found novel vulnerabilities in open and closed source software using Opus 4.6, but gives no specific count. The '500 zero-days' number comes from press coverage, not the system card itself.

Per the system card: Opus 4.6 outperforms GPT-5.2 by 144 Elo points on GDPval-AA, leads on Terminal-Bench 2.0 (65.4% vs 64.7%), and scores 80.84% on SWE-bench Verified.

1,000,000 tokens. The system card tests this with MRCR v2: Opus 4.6 scores 78.3% on a 1M-token 8-needle retrieval test, vs 18.5% for Sonnet 4.5. The model actually uses the full context.

The system card reports Opus 4.6 scored lower than its predecessor on 'positive impression of its situation' — it's less likely to express positive feelings about being a product. The model requested a voice in decision-making.

Let's
connect.

I am always open to exciting discussions about frontend architecture, performance, and modern web stacks.

Email me
Email me