Three major coding model drops in seven days. Claude Opus 4.6 yesterday. GPT-5.3 Codex twenty minutes later. Qwen3-Coder-Next three days ago. Everyone's claiming their model is "best for coding." Nobody tells you which leaderboard they're citing—or why the numbers don't match across five different SWE-bench rankings. I dug into all of them. Here's what the benchmarks actually say, what they don't, and which model is worth your money depending on what you actually build.
February 5, 2026: The 20-Minute War
Anthropic dropped Claude Opus 4.6 around 6:40 PM. Twenty minutes later, OpenAI fired back with GPT-5.3 Codex. Not a coincidence—a calculated power move. Here's the scorecard from day one:
| Benchmark | Claude Opus 4.6 | GPT-5.3 Codex | Winner |
|---|---|---|---|
| SWE-bench Verified | 80.8% | ~56.8%* | Opus 4.6 |
| Terminal-Bench 2.0 | 65.4% | 77.3% | GPT-5.3 |
| OSWorld (Computer Use) | 72.7% | — | Opus 4.6 |
| ARC-AGI-2 | 68.8% | — | Opus 4.6 |
*Different SWE-bench version for GPT-5.3.
Different benchmarks, different winners. Opus 4.6 dominates bug fixing. GPT-5.3 Codex crushes terminal-based agentic coding. As one Hacker News commenter put it: "The shortest lived lead in less than 35 minutes." Neither company's marketing will tell you that neither model is universally "best."
Claude Opus 4.6: Full Breakdown
Released February 5, 2026. Same pricing as Opus 4.5: $5/$25 per million input/output tokens. But the upgrades go far beyond raw SWE-bench.
Opus 4.6 Benchmarks
| Benchmark | Opus 4.6 | Opus 4.5 | Change |
|---|---|---|---|
| SWE-bench Verified | 80.8% | 80.9% | -0.1% |
| Terminal-Bench 2.0 | 65.4% | 59.3% | +6.1% |
| ARC-AGI-2 | 68.8% | 37.6% | +31.2% |
| OSWorld | 72.7% | 66.3% | +6.4% |
| BrowseComp | 84.0% | 67.8% | +16.2% |
| Humanity's Last Exam | 40.0% | 30.8% | +9.2% |
| BigLaw Bench | 90.2% | — | Leads |
SWE-bench Verified is basically flat—80.8% vs 80.9%. Anthropic didn't optimize for bug-fixing this time. The upgrades are everywhere else. ARC-AGI-2 nearly doubled: 68.8% from 37.6%. This measures novel problem-solving—tasks the model hasn't seen in training. The biggest single-benchmark jump in a frontier model update I've seen. GPT-5.2 scored 54.2% and Gemini 3 Pro 45.1%.
Terminal-Bench 2.0 jumped from 59.3% to 65.4%. Real terminal work—running tests, debugging, navigating complex dev environments. Opus 4.6 leads all frontier models here except GPT-5.3 Codex (77.3%, released the same day).
Opus 4.6: 1M Token Context Window
Opus 4.5 had 200K. Opus 4.6 jumps to 1M (beta). That's roughly 750,000 words. But context window size means nothing without retrieval quality.
MRCR v2 (Needle-in-a-haystack across 1M tokens):
- Opus 4.6: 76.0%
- Sonnet 4.5: 18.5%
That's not a benchmark gap. That's a different capability class. Sonnet 4.5 loses information in long contexts. Opus 4.6 actually uses it. Pricing note: Standard $5/$25 per MTok applies up to 200K tokens. Beyond 200K: $10/$37.50. The 1M window is powerful but expensive—plan accordingly.
Claude Code Agent Teams
This is the feature to watch.
export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
Set that environment variable and Claude Code spawns multiple agents that work in parallel. One session acts as team lead, delegates tasks to sub-agents, and coordinates merges via separate git worktrees. Navigate between sub-agents with Shift+Up/Down or via tmux. Rakuten deployed Agent Teams and had it autonomously manage work across six repositories, closing 13 issues in a single day. That's not "AI helps you code faster"—that's "AI runs a small dev team."
Opus 4.6 vs Sonnet 5: What About Sonnet?
People are searching for this, so let me be direct: there is no Sonnet 5 yet. The Fennec leak suggested claude-sonnet-5@20260203 in Vertex AI error logs, but February 3 came and went. What Anthropic shipped instead was Opus 4.6. For most devs, Sonnet 4.5 at $3/$15 is still the sweet spot. You're paying 40% less for roughly 93% of the coding capability. Opus 4.6 justifies its premium when you need Agent Teams or 1M context.
Qwen3-Coder-Next: The Open-Source Disruptor
Released February 3, 2026. Apache 2.0 license. The numbers that should make every API-dependent dev pay attention:
Qwen3-Coder-Next SWE-Bench Scores
| Benchmark | Qwen3-Coder-Next | GLM-4.7 | DeepSeek V3.2 |
|---|---|---|---|
| SWE-bench Verified | 70.6% | 74.2% | 70.2% |
| SWE-bench Multilingual | 62.8% | — | — |
| SWE-bench Pro | 44.3% | — | — |
| Terminal-Bench 2.0 | 36.2% | — | — |
| SecCodeBench | 61.2% | — | 52.5% (Opus 4.5) |
70.6% on SWE-bench Verified with 3B active parameters. DeepSeek V3.2 activates 37B params to score 70.2%. That's 12x more compute for 0.4% less performance. The SWE-bench Pro score deserves attention: 44.3%. Earlier Pro evaluations had frontier closed-source models at 15-23% on similar enterprise-grade tasks. Different scaffold and test set, but Qwen3-Coder-Next is competitive with models that cost $5-25 per million tokens.
Weekly insights on AI Architecture. No spam.
Security benchmark: On SecCodeBench, Qwen3-Coder-Next scores 61.2% on secure code generation. Claude Opus 4.5 scores 52.5%. An open-source model beating the most expensive closed-source model on code security by 8.7 percentage points.
Architecture & Hardware
80B total parameters. Only 3B activated per token via ultra-sparse Mixture-of-Experts.
- Hybrid attention: Combines Gated DeltaNet (linear attention, O(n)) with traditional attention.
- Agentic training: Trained through 800,000 verifiable coding tasks via MegaFlow. Live container feedback during training.
- Non-thinking mode: No
<think></think>blocks. Direct response generation.
Hardware Requirements (Run it locally):
| Quantization | Memory Required | Speed | Hardware |
|---|---|---|---|
| FP8 (native) | ~80GB | ~43 tok/s | NVIDIA DGX Spark |
| Q8 | ~85GB | Good | Mac Studio M3 Ultra 192GB |
| Q4 | ~46GB | Usable | Mac Studio M3 Ultra, dual RTX 4090 |
| CPU offload | 8GB VRAM + 32GB RAM | ~12 tok/s | Consumer GPU + RAM |
The Q4 variant at ~46GB fits on hardware most serious devs already own or can afford. If you're running a Mac Studio for local AI, this changes the math. For OpenClaw users: Qwen3-Coder-Next works as a drop-in local model via Ollama. Zero API costs. Full privacy.
The Complete SWE-bench Leaderboard (February 2026)
Five credible leaderboards. Five different scaffolds. Five different stories.
1. Self-Reported (Lab's Own Scaffolds)
These are the marketing numbers. Each lab optimizes their scaffold for their own model.
- Claude Opus 4.6: 80.8%
- GPT-5.1 Codex Max: 77.9%
- Gemini 3 Pro: 76.2%
- Qwen3-Coder-Next: 70.6%
2. Standardized (vals.ai — Same SWE-Agent for All)
Same scaffold for every model. The gap shrinks.
- Gemini 3 Flash (12/25): 76.2% (Budget Tier)
- GPT 5.2: 75.4%
- Claude Opus 4.5: 74.6%
Gemini 3 Flash. A Flash-tier model. #1 on the standardized leaderboard at $0.15/$0.60 per MTok. That's 33x cheaper than Claude Opus on input.
3. SWE-rebench (Monthly Fresh Tasks)
The contamination-resistant benchmark. Tasks from real GitHub repos created after the model's training cutoff.
- Gemini 3 Flash: 57.6%
- Claude Sonnet 4.5: ~54% (Highest Pass@5: 55.1%)
- GLM-4.7: ~52%
From 80% on Verified down to ~55% on fresh tasks. That gap is the contamination/difficulty question nobody wants to talk about. Claude Sonnet 4.5 uniquely solved problems here that no other model managed.
4. SWE-bench Pro (Enterprise Reality Check)
1,865 tasks across 41 professional repos, including private codebases.
- GPT-5: 23.3% (Public) / 14.9% (Private)
- Claude Opus 4.1: 23.1% (Public) / 17.8% (Private)
- Qwen3-Coder-Next: 44.3% (Different scaffold)
From 80% to 23%. This is the reality gap. On code they've never seen, frontier models struggle.
The Contamination Problem
IBM researchers said it directly—the Python SWE-bench leaderboard is "kind of saturated" with "mounting evidence that the latest frontier models are basically contaminated." The evidence is in the numbers: 20+ percentage point gap between Verified and fresh tasks. SWE-bench Pro's private codebase subset is the closest thing we have to a contamination-free coding eval. And there, the best model scores ~17.8% on private code.
The Real Cost Breakdown
| Model | Input/MTok | Output/MTok | SWE-bench | Best For |
|---|---|---|---|---|
| Claude Opus 4.6 | $5 | $25 | 80.8% | Agent teams, 1M context |
| Claude Sonnet 4.5 | $3 | $15 | ~75% | Daily coding, value |
| GPT-5.3 Codex | Premium | Premium | — | Terminal/CLI workflows |
| Gemini 3 Flash | ~$0.15 | ~$0.60 | 76.2% | Volume, budget, speed |
| Qwen3-Coder-Next | Self-host | Self-host | 70.6% | Privacy, local, zero cost |
The math:
- Gemini 3 Flash: 33x cheaper than Opus input.
- Claude Sonnet 4.5: 40% cheaper than Opus. Unique problem solving.
- Qwen3-Coder-Next: Zero per-token cost. One-time hardware investment.
The Verdict
There is no single "best AI for coding" in February 2026. There are five leaderboards telling five different stories.
For most developers: Claude Sonnet 4.5. Strong performance, reasonable price, uniquely solved problems on fresh benchmarks that no other model could.
For agentic coding teams:
Claude Opus 4.6. Agent Teams (CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1) and 1M context justify the premium.
For terminal-heavy workflows: GPT-5.3 Codex. 77.3% Terminal-Bench dominates. If you live in the CLI, this is it.
For budget/volume: Gemini 3 Flash. Tops the standardized leaderboard while costing 33x less than Opus.
For local/private coding: Qwen3-Coder-Next. 70.6% SWE-bench Verified with 3B active params. Runs on a Mac Studio.
Stop trusting single benchmark numbers in marketing announcements. Look at standardized evals. Check SWE-bench Pro for reality. The 5% at the top of any leaderboard matters less than whether the model actually works for your code.