Marco Patzelt

SWE-Bench Verified Leaderboard March 2026

Current AI model rankings and latest top scores across SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench 2.0 & Aider Polyglot — updated March 2026.

Filter:
#ModelProviderScore
1
Claude Opus 4.5
Anthropic80.9%
2
Claude Opus 4.6
Anthropic80.8%
3
Gemini 3.1 Pro
Google80.6%
4
MiniMax M2.5Open-weight
MiniMax80.2%
5
GPT-5.2
OpenAI80.0%
6
Claude Sonnet 4.6
Anthropic79.6%
7
Gemini 3 Flash
Google78.0%
8
GLM-5744B params · Open-source
Zhipu AI77.8%
9
Claude Sonnet 4.5
Anthropic77.2%
10
Kimi K2.5Open-source
Moonshot AI76.8%
11
Gemini 3 Pro
Google76.2%
12
GPT-5.1
OpenAI74.9%
13
Grok 4Self-reported 72-75%
xAI73.5%
14
Claude Haiku 4.5
Anthropic73.3%
15
DeepSeek V3.2Open-source
DeepSeek73.0%
16
Claude Sonnet 4Scaffold-dependent*
Anthropic72.7%
17
Qwen3-Coder-Next3B active params · Open
Alibaba70.6%
18
Gemini 2.5 Pro
Google63.8%
19
GPT-OSS-120BOpen-source
OpenAI62.4%
20
GLM-4.7Open-source · approx.
Zhipu AI60.0%
21
Grok Code Fast
xAI57.6%
22
GPT-4.1
OpenAI54.6%
23
o3
OpenAI49.8%

Source: swebench.com

Scores are self-reported by model providers unless noted. Scaffold/harness differences affect results.

4 Benchmarks·72 Model Entries·Updated March 2026

SWE-Bench Verified Leaderboard: Top Models March 2026

The current leader on SWE-Bench Verified in March 2026 is Claude Opus 4.5 at 80.9%, followed by Claude Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, MiniMax M2.5 at 80.2%, and GPT-5.2 at 80.0%. Claude Sonnet 4.6 scores 79.6% — a mid-tier model nearly matching flagships. The table above shows all 23 models with their latest top scores and rankings.

Terminal-Bench 2.0 Leaderboard: Gemini 3.1 Pro Takes #1

Gemini 3.1 Pro now leads Terminal-Bench 2.0 at 78.4%, overtaking Codex CLI (GPT-5) at 77.3%. Claude Opus 4.6 jumped to 74.7% (up from 65.4% in January). Droid with Opus 4.6 scores 69.9%. Claude Code reaches 58.0% as a separate scaffold.

SWE-Bench Pro Leaderboard: GPT-5.3-Codex Leads

On SWE-Bench Pro, GPT-5.3-Codex leads at 56.8%, followed by GPT-5.2-Codex at 56.4% and GPT-5.2 at 55.6%. Scores vary dramatically by scaffold — Scale AI's SEAL leaderboard with standardized scaffolding shows Claude Opus 4.5 leading at 45.9%.

Open-Source Models on SWE-Bench 2026

MiniMax M2.5 leads open-weight models on SWE-Bench Verified at 80.2%, ranking #4 overall. GLM-5 from Zhipu AI follows at 77.8% with 744B parameters trained on Huawei chips. Kimi K2.5 from Moonshot AI scores 76.8%. DeepSeek V3.2 reaches 73.0%, and Qwen3-Coder-Next hits 70.6% with only 3B active parameters. Open-source is closing the gap to proprietary frontier models fast.

Best AI Coding Model March 2026

The best model for coding depends on the workflow. Claude Opus 4.5 is the best model on SWE-Bench Verified for Python-heavy repository tasks at 80.9%, with Opus 4.6 at 80.8% and Gemini 3.1 Pro at 80.6%. Claude Sonnet 4.6 punches above its weight at 79.6% — only 1.2 points behind Opus and 5x cheaper. Grok 4 from xAI scores 79.6% on Aider Polyglot.

For terminal and DevOps workflows, Gemini 3.1 Pro now leads Terminal-Bench 2.0 at 78.4%, overtaking GPT-5.3-Codex at 77.3%. Claude Opus 4.6 jumped to 74.7% on the same benchmark.

Budget-conscious? DeepSeek V3.2-Exp delivers 74.2% on Aider Polyglot at $1.30/run — 22x cheaper than GPT-5. And Qwen3-Coder-Next punches well above its weight with 70.6% on SWE-Bench Verified using only 3B active parameters, making it the most efficient model in the top 20.

Frequently Asked Questions

Claude Opus 4.5 from Anthropic holds the highest self-reported SWE-Bench Verified score at 80.9%, closely followed by Claude Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, MiniMax M2.5 at 80.2%, and GPT-5.2 at 80.0%. Claude Sonnet 4.6 scores 79.6%, GLM-5 from Zhipu AI reaches 77.8%, Claude Sonnet 4.5 hits 77.2%, and Kimi K2.5 scores 76.8%.

Claude Opus 4.6 scores 80.8% on SWE-Bench Verified (second to Opus 4.5 at 80.9%) as of March 2026. On Terminal-Bench 2.0, Opus 4.6 reaches 74.7%, and 69.9% when used with the Droid agent framework.

Gemini 3.1 Pro from Google DeepMind scores 80.6% on SWE-Bench Verified as of February 2026, placing it #3 overall. It also leads Terminal-Bench 2.0 at 78.4% and scores 77.1% on ARC-AGI-2. Released February 19, 2026 at the same price as Gemini 3 Pro.

xAI self-reports 72-75% for Grok 4 on SWE-Bench Verified. Independent testing by vals.ai with SWE-agent scaffold shows 58.6% — a significant gap that highlights how scaffold choice affects results. On Aider Polyglot, Grok 4 scores 79.6%.

Claude Sonnet 4.6 scores 79.6% on SWE-Bench Verified, only 1.2 points behind Opus 4.6 and 2.4 points ahead of Sonnet 4.5. At $3/$15 per million tokens — five times cheaper than Opus — it offers strong cost-efficiency for coding tasks.

GPT-5.3-Codex from OpenAI leads SWE-Bench Pro with a score of 56.8%, closely followed by GPT-5.2-Codex at 56.4% and GPT-5.2 at 55.6%. Note: SWE-Bench Pro scores vary significantly by scaffold — Scale AI's SEAL leaderboard with standardized scaffolding shows Claude Opus 4.5 leading at 45.9%.

It depends on the use case. Claude Opus 4.5 leads SWE-Bench Verified (80.9%) for Python-heavy tasks, with Gemini 3.1 Pro close behind at 80.6%. Gemini 3.1 Pro leads Terminal-Bench 2.0 (78.4%) for terminal workflows. GPT-5.3-Codex leads SWE-Bench Pro (56.8%). Grok 4 scores 79.6% on Aider Polyglot. DeepSeek V3.2-Exp offers the best cost-efficiency at $1.30 per run.

The top score jumped from around 65% in early 2025 to 80.9% in March 2026. Anthropic holds the #1 and #2 spots. Gemini 3.1 Pro surged to 80.6%, placing #3. MiniMax M2.5 reaches 80.2% as an open-weight model. GPT-5.2 sits at 80.0%. Claude Sonnet 4.6 scores 79.6% — a mid-tier model nearly matching flagships. Three Chinese open-source models sit in the top 10. Agent frameworks outperform raw model scores by 10-20 points.

This leaderboard is updated monthly with the latest benchmark scores from SWE-Bench Verified, Terminal-Bench 2.0, Aider Polyglot, and SWE-Bench Pro. Scores are self-reported by model providers unless noted. Scaffold/harness differences affect results.

For pure code generation on Python repos, Claude Opus 4.5 leads at 80.9% on SWE-Bench Verified. For terminal-heavy DevOps workflows, Gemini 3.1 Pro scores 78.4% on Terminal-Bench 2.0. For cost-efficiency, DeepSeek V3.2-Exp delivers 74.2% on Aider Polyglot at just $1.30 per run — 22x cheaper than GPT-5.

MiniMax M2.5 leads open-weight models on SWE-Bench Verified at 80.2%, ranking #4 overall. GLM-5 from Zhipu AI follows at 77.8% with 744B parameters. Kimi K2.5 from Moonshot AI scores 76.8%. DeepSeek V3.2 reaches 73.0%, Qwen3-Coder-Next hits 70.6% with only 3B active parameters. On Aider Polyglot, DeepSeek V3.2-Exp scores 74.2% at just $1.30 per run.

Gemini 3.1 Pro leads Terminal-Bench 2.0 at 78.4% as of March 2026, overtaking GPT-5.3-Codex at 77.3%. Claude Opus 4.6 reaches 74.7%. Droid with Claude Opus 4.6 scores 69.9%. GPT-5.2 reaches 62.2%.

Read the full analysis →

Let's
connect.

I build middleware by day and autonomous agent systems by night. If you're working on something serious in agentic infrastructure, I'd like to hear about it.

Email me
Email me