Marco Patzelt

SWE-Bench Verified Leaderboard April 2026

Current AI model rankings and latest top scores across SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench 2.0 & Aider Polyglot — updated April 2026.

Filter:
#ModelProviderScore
1
Claude Opus 4.7New · 1M context
Anthropic87.6%
2
GPT-5.3-Codex
OpenAI85.0%
3
Claude Opus 4.5
Anthropic80.9%
4
Claude Opus 4.6
Anthropic80.8%
5
Gemini 3.1 Pro
Google80.6%
6
MiniMax M2.5Open-weight
MiniMax80.2%
7
GPT-5.2
OpenAI80.0%
8
Claude Sonnet 4.6
Anthropic79.6%
9
Qwen3.6 PlusNew
Alibaba78.8%
10
Gemini 3 Flash
Google78.0%
11
MiMo-V2-Pro1T params · Open-source
Xiaomi78.0%
12
GLM-5744B params · Open-source
Zhipu AI77.8%
13
Muse SparkNew · MSL flagship
Meta77.4%
14
Claude Sonnet 4.5
Anthropic77.2%
15
Kimi K2.5Open-source
Moonshot AI76.8%
16
Gemini 3 Pro
Google76.2%
17
GPT-5.1
OpenAI74.9%
18
MiMo-V2-OmniOpen-source
Xiaomi74.8%
19
GLM-4.7Open-source
Zhipu AI73.8%
20
Grok 4Self-reported 72-75%
xAI73.5%
21
Claude Haiku 4.5
Anthropic73.3%
22
DeepSeek V3.2Open-source
DeepSeek73.0%
23
Claude Sonnet 4Scaffold-dependent*
Anthropic72.7%
24
Qwen3-Coder-Next3B active params · Open
Alibaba70.6%
25
Gemini 2.5 Pro
Google63.8%
26
GPT-OSS-120BOpen-source
OpenAI62.4%
27
Grok Code Fast
xAI57.6%
28
GPT-4.1
OpenAI54.6%
29
o3
OpenAI49.8%

Source: swebench.com

Scores are self-reported by model providers unless noted. Scaffold/harness differences affect results.

4 Benchmarks·100 Model Entries·Updated April 2026

SWE-Bench Verified Leaderboard: Claude Opus 4.7 Takes #1

Claude Opus 4.7 from Anthropic now leads SWE-Bench Verified at 87.6% following its April 16, 2026 release with 1M context. GPT-5.3-Codex follows at 85.0%. Claude Opus 4.5 sits at 80.9%, Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, MiniMax M2.5 at 80.2%, and GPT-5.2 at 80.0%. Qwen3.6 Plus from Alibaba (April 2026, 78.8%) and Muse Spark from Meta (77.4%) round out the major April entrants.

Terminal-Bench 2.0: ForgeCode Scaffold Tops the Board

ForgeCode with Claude Opus 4.6 and ForgeCode with GPT-5.4 are tied for #1 on Terminal-Bench 2.0 at 81.8%. TongAgents with Gemini 3.1 Pro reaches 80.2%. SageAgent + GPT-5.3-Codex and ForgeCode + Gemini 3.1 Pro both hit 78.4%. Factory.ai's Droid scaffold with GPT-5.3-Codex follows at 77.3%. Anthropic self-reports Claude Opus 4.7 at 69.4%, pending submission to tbench.ai.

SWE-Bench Pro: Claude Opus 4.7 Leads at 64.3%

On SWE-Bench Pro, Claude Opus 4.7 leads at 64.3% (Anthropic-reported, April 2026). GPT-5.4 (xHigh) reaches 59.1% on Scale's SEAL mini-swe-agent scaffold. GPT-5.3-Codex (agent system) scores 56.8%, GPT-5.2-Codex 56.4%, and Muse Spark from Meta 55.0%. Claude Opus 4.6 scores 51.9% on the SEAL mini-swe-agent harness. Scale's fully standardized SEAL board puts Claude Opus 4.5 in the lead at 45.9%.

Open-Source Models on SWE-Bench 2026

MiniMax M2.5 leads open-weight models on SWE-Bench Verified at 80.2%, still in the top 10 overall. MiMo-V2-Pro from Xiaomi reaches 78.0% with 1T parameters. GLM-5 from Zhipu AI follows at 77.8% with 744B parameters trained on Huawei chips. Kimi K2.5 from Moonshot AI scores 76.8%. GLM-4.7 reaches 73.8% (corrected upward from earlier reports). DeepSeek V3.2 hits 73.0%, and Qwen3-Coder-Next achieves 70.6% with only 3B active parameters.

Best AI Coding Model April 2026

Claude Opus 4.7 is the clear overall leader in April 2026 — 87.6% on SWE-Bench Verified and 64.3% on SWE-Bench Pro, both #1. GPT-5.3-Codex follows at 85.0% on SWE-Bench Verified. Claude Sonnet 4.6 punches above its weight at 79.6% — still only 1.2 points behind Opus 4.6 and 5x cheaper.

For terminal and DevOps workflows, ForgeCode scaffolds with Claude Opus 4.6 or GPT-5.4 top Terminal-Bench 2.0 at 81.8%. TongAgents + Gemini 3.1 Pro reaches 80.2%. On multi-language editing (Aider Polyglot), Claude Opus 4.5 leads at 89.4% (Anthropic-reported), with GPT-5 (high) at 88.0%.

Budget-conscious? DeepSeek V3.2-Exp delivers 74.2% on Aider Polyglot at $1.30/run — 22x cheaper than GPT-5. Qwen3-Coder-Next scores 70.6% on SWE-Bench Verified using only 3B active parameters, the most efficient model in the top 25.

Frequently Asked Questions

Claude Opus 4.7 from Anthropic leads SWE-Bench Verified at 87.6%, released April 16, 2026. GPT-5.3-Codex from OpenAI follows at 85.0%. Next come Claude Opus 4.5 at 80.9%, Claude Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, MiniMax M2.5 at 80.2%, and GPT-5.2 at 80.0%. Claude Sonnet 4.6 scores 79.6%. Qwen3.6 Plus reaches 78.8% as Alibaba's new flagship.

Claude Opus 4.7 scores 87.6% on SWE-Bench Verified, 64.3% on SWE-Bench Pro, and Anthropic reports 69.4% on Terminal-Bench 2.0 (not yet on the public tbench.ai board). Released April 16, 2026 with 1M token context, it now leads all publicly-available models on SWE-Bench Verified and SWE-Bench Pro.

Claude Opus 4.6 scores 80.8% on SWE-Bench Verified and 51.9% on SWE-Bench Pro (Scale SEAL mini-swe-agent). On Terminal-Bench 2.0, Opus 4.6 reaches 74.7% (Terminus-KIRA scaffold), and ForgeCode + Opus 4.6 tops the board at 81.8%. Released January 2026.

Gemini 3.1 Pro from Google DeepMind scores 80.6% on SWE-Bench Verified as of February 2026. On Terminal-Bench 2.0, TongAgents + Gemini 3.1 Pro reaches 80.2% and ForgeCode + Gemini 3.1 Pro reaches 78.4%. On SWE-Bench Pro (Scale SEAL mini-swe-agent), it scores 46.1%.

xAI self-reports 72-75% for Grok 4 on SWE-Bench Verified. Independent testing by vals.ai with SWE-agent scaffold shows 58.6% — a significant gap that highlights how scaffold choice affects results. On Aider Polyglot, Grok 4 scores 79.6%. xAI has since released Grok 4.20, now its current flagship.

Claude Sonnet 4.6 scores 79.6% on SWE-Bench Verified, only 1.2 points behind Opus 4.6 and 2.4 points ahead of Sonnet 4.5. At $3/$15 per million tokens — five times cheaper than Opus — it offers strong cost-efficiency for coding tasks.

Claude Opus 4.7 leads SWE-Bench Pro at 64.3% (Anthropic-reported, April 2026). GPT-5.4 (xHigh) scores 59.1% on Scale SEAL mini-swe-agent. Agent-system scores: GPT-5.3-Codex CLI at 56.8%, GPT-5.2-Codex at 56.4%, GPT-5.2 at 55.6%. Muse Spark from Meta reaches 55.0%. On Scale SEAL standardized scaffolding, Claude Opus 4.5 leads at 45.9%.

Claude Opus 4.7 leads SWE-Bench Verified (87.6%) and SWE-Bench Pro (64.3%) as of April 2026 — the new overall leader for coding. GPT-5.3-Codex reaches 85.0% on SWE-Bench Verified. ForgeCode scaffolds with Opus 4.6 or GPT-5.4 top Terminal-Bench 2.0 at 81.8%. On Aider Polyglot, Claude Opus 4.5 leads at 89.4% (Anthropic-reported). DeepSeek V3.2-Exp offers the best cost-efficiency at $1.30 per run.

The top score jumped from around 65% in early 2025 to 87.6% in April 2026 with Claude Opus 4.7. Anthropic holds the #1 spot; GPT-5.3-Codex at 85.0% is #2. Gemini 3.1 Pro sits at 80.6%, MiniMax M2.5 at 80.2% as an open-weight model, GPT-5.2 at 80.0%, and Claude Sonnet 4.6 at 79.6%. New entrants include Qwen3.6 Plus (78.8%), Muse Spark from Meta (77.4%), and MiMo-V2-Pro from Xiaomi (78.0%). Agent frameworks outperform raw model scores by 5-15 points.

This leaderboard is updated monthly with the latest benchmark scores from SWE-Bench Verified, Terminal-Bench 2.0, Aider Polyglot, and SWE-Bench Pro. Scores are self-reported by model providers unless noted. Scaffold/harness differences affect results.

For pure code generation, Claude Opus 4.7 leads SWE-Bench Verified at 87.6% and SWE-Bench Pro at 64.3%. For terminal and DevOps workflows, ForgeCode + Opus 4.6 or GPT-5.4 tops Terminal-Bench 2.0 at 81.8%. For multi-language editing, Claude Opus 4.5 leads Aider Polyglot at 89.4%. For cost-efficiency, DeepSeek V3.2-Exp delivers 74.2% on Aider Polyglot at just $1.30 per run.

MiniMax M2.5 leads open-weight models on SWE-Bench Verified at 80.2%, still in the top 10 overall. MiMo-V2-Pro from Xiaomi reaches 78.0% with 1T parameters. GLM-5 from Zhipu AI follows at 77.8% with 744B parameters. Kimi K2.5 scores 76.8%. GLM-4.7 reaches 73.8% (corrected from earlier reports). DeepSeek V3.2 hits 73.0%, Qwen3-Coder-Next 70.6% with only 3B active parameters.

ForgeCode + Claude Opus 4.6 and ForgeCode + GPT-5.4 are tied at 81.8% on Terminal-Bench 2.0 as of April 2026. TongAgents + Gemini 3.1 Pro reaches 80.2%. SageAgent + GPT-5.3-Codex and ForgeCode + Gemini 3.1 Pro both hit 78.4%. Droid + GPT-5.3-Codex from Factory scores 77.3%. Anthropic reports Claude Opus 4.7 at 69.4% (not yet on the public tbench.ai board).

Read the full analysis →

Let's
connect.

I build middleware by day and autonomous agent systems by night. If you're working on something serious in agentic infrastructure, I'd like to hear about it.

Email me
Email me