Marco Patzelt logoMarco Patzelt

SWE-Bench Verified Leaderboard May 2026

Current AI model rankings and latest top scores across SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench 2.0 & Aider Polyglot — updated May 2026.

Filter:
#ModelProviderScore
1
GPT-5.5New · OpenAI-reported
OpenAI88.7%
2
Claude Opus 4.71M context
Anthropic87.6%
3
GPT-5.3-Codex
OpenAI85.0%
4
Claude Opus 4.5
Anthropic80.9%
5
Claude Opus 4.6
Anthropic80.8%
6
DeepSeek V4 Pro MaxNew · 1.6T MoE · Open-source
DeepSeek80.6%
7
Gemini 3.1 Pro
Google80.6%
8
Kimi K2.6New · 1T MoE · Open-weight
Moonshot AI80.2%
9
MiniMax M2.5Open-weight
MiniMax80.2%
10
GPT-5.2
OpenAI80.0%
11
Claude Sonnet 4.6
Anthropic79.6%
12
Qwen3.6 Plus
Alibaba78.8%
13
Gemini 3 Flash
Google78.0%
14
MiMo-V2-Pro1T params · Open-source
Xiaomi78.0%
15
GLM-5744B params · Open-source
Zhipu AI77.8%
16
Mistral Medium 3.5 128BNew · Open
Mistral77.6%
17
Muse SparkMSL flagship
Meta77.4%
18
Qwen3.6-27BNew · Dense · Apache 2.0
Alibaba77.2%
19
Claude Sonnet 4.5
Anthropic77.2%
20
Kimi K2.5Open-source
Moonshot AI76.8%
21
Gemini 3 Pro
Google76.2%
22
GPT-5.1
OpenAI74.9%
23
MiMo-V2-OmniOpen-source
Xiaomi74.8%
24
GLM-4.7Open-source
Zhipu AI73.8%
25
Grok 4Self-reported 72-75%
xAI73.5%
26
Qwen3.6-35B-A3BNew · MoE · Open
Alibaba73.4%
27
Claude Haiku 4.5
Anthropic73.3%
28
DeepSeek V3.2Open-source
DeepSeek73.0%
29
Claude Sonnet 4Scaffold-dependent*
Anthropic72.7%
30
Qwen3-Coder-Next3B active params · Open
Alibaba70.6%
31
Gemini 2.5 Pro
Google63.8%
32
GPT-OSS-120BOpen-source
OpenAI62.4%
33
Grok Code Fast
xAI57.6%
34
GPT-4.1
OpenAI54.6%
35
o3
OpenAI49.8%

Source: swebench.com

Scores are self-reported by model providers unless noted. Scaffold/harness differences affect results.

4 Benchmarks·109 Model Entries·Updated May 2026

SWE-Bench Verified Leaderboard: GPT-5.5 Takes #1

GPT-5.5 from OpenAI takes the new #1 spot on SWE-Bench Verified at 88.7% (OpenAI-reported, released April 23, 2026). Claude Opus 4.7 drops to 87.6% at #2 (April 16, 2026, 1M context), and GPT-5.3-Codex holds #3 at 85.0%. Below 81% the field is tight: Opus 4.5 (80.9%), Opus 4.6 (80.8%), DeepSeek V4 Pro Max (80.6%, new open-weight 1.6T MoE) and Gemini 3.1 Pro (80.6%) tie. Kimi K2.6 (80.2%, new open-weight) ties MiniMax M2.5. Other April entrants: Mistral Medium 3.5 (77.6%), Qwen3.6-27B (77.2%), Muse Spark (77.4%).

Terminal-Bench 2.0: Codex CLI + GPT-5.5 Takes the Lead

Codex CLI + GPT-5.5 hits 82.0% on Terminal-Bench 2.0 (April 23, 2026), the new outright #1. ForgeCode + GPT-5.4 holds 81.8%. TongAgents + Gemini 3.1 Pro reaches 80.2%. ForgeCode + Claude Opus 4.6 was revised down to 79.8% on the latest tbench run. SageAgent + GPT-5.3-Codex and ForgeCode + Gemini 3.1 Pro both sit at 78.4%. Factory.ai's Droid + GPT-5.3-Codex follows at 77.3%. Anthropic self-reports Claude Opus 4.7 at 69.4%, still pending tbench.ai submission.

SWE-Bench Pro: Claude Opus 4.7 Leads at 64.3%

On SWE-Bench Pro, Claude Opus 4.7 leads at 64.3% (Anthropic-reported, April 2026 release). GPT-5.4 (xHigh) reaches 59.1% on Scale's SEAL mini-swe-agent scaffold. GPT-5.3-Codex (agent system) scores 56.8%, GPT-5.2-Codex 56.4%, and Muse Spark from Meta 55.0%. Claude Opus 4.6 scores 51.9% on the SEAL mini-swe-agent harness. Scale's fully standardized SEAL board puts Claude Opus 4.5 in the lead at 45.9%.

Open-Source Models on SWE-Bench 2026

MiniMax M2.5 leads open-weight models on SWE-Bench Verified at 80.2%, still in the top 10 overall. MiMo-V2-Pro from Xiaomi reaches 78.0% with 1T parameters. GLM-5 from Zhipu AI follows at 77.8% with 744B parameters trained on Huawei chips. Kimi K2.5 from Moonshot AI scores 76.8%. GLM-4.7 reaches 73.8% (corrected upward from earlier reports). DeepSeek V3.2 hits 73.0%, and Qwen3-Coder-Next achieves 70.6% with only 3B active parameters.

Best AI Coding Model May 2026

Claude Opus 4.7 is the clear overall leader in May 2026 — 87.6% on SWE-Bench Verified and 64.3% on SWE-Bench Pro, both #1. GPT-5.3-Codex follows at 85.0% on SWE-Bench Verified. Claude Sonnet 4.6 punches above its weight at 79.6% — still only 1.2 points behind Opus 4.6 and 5x cheaper.

For terminal and DevOps workflows, ForgeCode scaffolds with Claude Opus 4.6 or GPT-5.4 top Terminal-Bench 2.0 at 81.8%. TongAgents + Gemini 3.1 Pro reaches 80.2%. On multi-language editing (Aider Polyglot), Claude Opus 4.5 leads at 89.4% (Anthropic-reported), with GPT-5 (high) at 88.0%.

Budget-conscious? DeepSeek V3.2-Exp delivers 74.2% on Aider Polyglot at $1.30/run — 22x cheaper than GPT-5. Qwen3-Coder-Next scores 70.6% on SWE-Bench Verified using only 3B active parameters, the most efficient model in the top 25.

Frequently Asked Questions

GPT-5.5 from OpenAI takes the new #1 spot at 88.7% (OpenAI-reported, released April 23, 2026). Claude Opus 4.7 from Anthropic drops to #2 at 87.6% (April 16, 2026, 1M context). GPT-5.3-Codex follows at 85.0%. Next: Claude Opus 4.5 at 80.9%, Claude Opus 4.6 at 80.8%, DeepSeek V4 Pro Max at 80.6% (new open-weight 1.6T MoE), Gemini 3.1 Pro at 80.6%, Kimi K2.6 at 80.2% (new open-weight), MiniMax M2.5 at 80.2%, and GPT-5.2 at 80.0%.

GPT-5.5 from OpenAI scores 88.7% on SWE-Bench Verified (OpenAI-reported, released April 23, 2026) — the new public leader, surpassing Claude Opus 4.7 by 1.1 points. On Terminal-Bench 2.0, Codex CLI + GPT-5.5 takes #1 at 82.0%. SWE-Bench Pro and Aider Polyglot scores are not yet published.

Claude Opus 4.7 scores 87.6% on SWE-Bench Verified (now #2 behind GPT-5.5 at 88.7%), 64.3% on SWE-Bench Pro (still #1, Anthropic-reported), and Anthropic reports 69.4% on Terminal-Bench 2.0 (not yet on the public tbench.ai board). Released April 16, 2026 with 1M token context.

Claude Opus 4.6 scores 80.8% on SWE-Bench Verified and 51.9% on SWE-Bench Pro (Scale SEAL mini-swe-agent). On Terminal-Bench 2.0, Opus 4.6 reaches 74.7% (Terminus-KIRA scaffold), and ForgeCode + Opus 4.6 tops the board at 81.8%. Released January 2026.

Gemini 3.1 Pro from Google DeepMind scores 80.6% on SWE-Bench Verified as of February 2026. On Terminal-Bench 2.0, TongAgents + Gemini 3.1 Pro reaches 80.2% and ForgeCode + Gemini 3.1 Pro reaches 78.4%. On SWE-Bench Pro (Scale SEAL mini-swe-agent), it scores 46.1%.

xAI self-reports 72-75% for Grok 4 on SWE-Bench Verified. Independent testing by vals.ai with SWE-agent scaffold shows 58.6% — a significant gap that highlights how scaffold choice affects results. On Aider Polyglot, Grok 4 scores 79.6%. xAI has since released Grok 4.20, now its current flagship.

Claude Sonnet 4.6 scores 79.6% on SWE-Bench Verified, only 1.2 points behind Opus 4.6 and 2.4 points ahead of Sonnet 4.5. At $3/$15 per million tokens — five times cheaper than Opus — it offers strong cost-efficiency for coding tasks.

Claude Opus 4.7 leads SWE-Bench Pro at 64.3% (Anthropic-reported, April 2026 release). GPT-5.4 (xHigh) scores 59.1% on Scale SEAL mini-swe-agent. Agent-system scores: GPT-5.3-Codex CLI at 56.8%, GPT-5.2-Codex at 56.4%, GPT-5.2 at 55.6%. Muse Spark from Meta reaches 55.0%. On Scale SEAL standardized scaffolding, Claude Opus 4.5 leads at 45.9%.

GPT-5.5 leads SWE-Bench Verified at 88.7% (OpenAI-reported) as of May 2026. Claude Opus 4.7 follows at 87.6% and still leads SWE-Bench Pro at 64.3% (Anthropic-reported). GPT-5.3-Codex reaches 85.0% on SWE-Bench Verified. On Terminal-Bench 2.0, Codex CLI + GPT-5.5 takes #1 at 82.0%, ForgeCode + GPT-5.4 follows at 81.8%. On Aider Polyglot, Claude Opus 4.5 leads at 89.4% (Anthropic-reported). DeepSeek V3.2-Exp offers the best cost-efficiency at $1.30 per run.

The top score jumped from around 65% in early 2025 to 88.7% in May 2026 with GPT-5.5 (OpenAI, April 23, 2026). Claude Opus 4.7 is #2 at 87.6%, GPT-5.3-Codex #3 at 85.0%. Gemini 3.1 Pro and DeepSeek V4 Pro Max tie at 80.6%. April 2026 new entrants: GPT-5.5 (88.7%), Claude Opus 4.7 (87.6%), DeepSeek V4 Pro Max (80.6%, open-weight), Kimi K2.6 (80.2%, open-weight), Qwen3.6 Plus (78.8%), Mistral Medium 3.5 (77.6%), Muse Spark (77.4%), Qwen3.6-27B (77.2%). Agent frameworks outperform raw model scores by 5-15 points.

This leaderboard is updated monthly with the latest benchmark scores from SWE-Bench Verified, Terminal-Bench 2.0, Aider Polyglot, and SWE-Bench Pro. Scores are self-reported by model providers unless noted. Scaffold/harness differences affect results.

For pure code generation, GPT-5.5 leads SWE-Bench Verified at 88.7%; Claude Opus 4.7 leads SWE-Bench Pro at 64.3%. For terminal and DevOps workflows, Codex CLI + GPT-5.5 tops Terminal-Bench 2.0 at 82.0%. For multi-language editing, Claude Opus 4.5 leads Aider Polyglot at 89.4%. For cost-efficiency, DeepSeek V3.2-Exp delivers 74.2% on Aider Polyglot at just $1.30 per run.

DeepSeek V4 Pro Max leads open-weight on SWE-Bench Verified at 80.6% (1.6T MoE, April 2026), tied with closed-source Gemini 3.1 Pro. Kimi K2.6 follows at 80.2% (April 2026, 1T MoE), tied with MiniMax M2.5. MiMo-V2-Pro from Xiaomi reaches 78.0% with 1T parameters. GLM-5 from Zhipu AI: 77.8% (744B). Mistral Medium 3.5: 77.6% (April 2026). Qwen3.6-27B: 77.2% (April 2026, dense Apache 2.0). Kimi K2.5: 76.8%. GLM-4.7: 73.8%. DeepSeek V3.2: 73.0%. Qwen3-Coder-Next: 70.6% with only 3B active parameters.

Codex CLI + GPT-5.5 is the new outright #1 at 82.0% (April 23, 2026, OpenAI). ForgeCode + GPT-5.4 holds 81.8%. TongAgents + Gemini 3.1 Pro reaches 80.2%. ForgeCode + Claude Opus 4.6 was revised to 79.8% on the latest tbench run. SageAgent + GPT-5.3-Codex and ForgeCode + Gemini 3.1 Pro both hit 78.4%. Droid + GPT-5.3-Codex from Factory.ai scores 77.3%. Anthropic reports Claude Opus 4.7 at 69.4% (not yet on the public tbench.ai board).

Read the full analysis →

Get in touch

Let's connect.

If you're working on something serious in agentic infrastructure — tool design, harness engineering, orchestration loops — drop me a line.