Marco Patzelt

SWE-Bench Verified Leaderboard February 2026

Current AI model rankings and latest top scores across SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench 2.0 & Aider Polyglot — updated February 2026.

Filter:
#ModelProviderScore
1
Claude Opus 4.5
Anthropic80.9%
2
Claude Opus 4.6
Anthropic80.8%
3
GPT-5.2
OpenAI80.0%
4
Gemini 3 Flash
Google78.0%
5
Claude Sonnet 4.5
Anthropic77.2%
6
Gemini 3 Pro
Google76.2%
7
GPT-5.1
OpenAI74.9%
8
Grok 4Self-reported 72-75%
xAI73.5%
9
Claude Haiku 4.5
Anthropic73.3%
10
DeepSeek V3.2Open-source
DeepSeek73.0%
11
Claude Sonnet 4Scaffold-dependent*
Anthropic72.7%
12
Qwen3-Coder-Next3B active params · Open
Alibaba70.6%
13
Kimi K2Open-source
Moonshot AI65.8%
14
Gemini 2.5 Pro
Google63.8%
15
GPT-OSS-120BOpen-source
OpenAI62.4%
16
GLM-4.7Open-source · approx.
Zhipu AI60.0%
17
Grok Code Fast
xAI57.6%
18
GPT-4.1
OpenAI54.6%
19
MiniMax M2.1Open-weight · approx.
MiniMax52.0%
20
o3
OpenAI49.8%

Source: swebench.com

Scores are self-reported by model providers unless noted. Scaffold/harness differences affect results.

4 Benchmarks·64 Model Entries·Updated February 2026

SWE-Bench Verified Leaderboard: Top Models February 2026

The current leader on SWE-Bench Verified in February 2026 is Claude Opus 4.5 at 80.9%, followed by Claude Opus 4.6 at 80.8% and GPT-5.2 at 80.0%. Grok 4 from xAI self-reports 72-75%. The table above shows all 20 models with their latest top scores and rankings.

SWE-Bench Pro Leaderboard: GPT-5.3-Codex Leads

On SWE-Bench Pro, GPT-5.3-Codex leads at 56.8%, followed by GPT-5.2-Codex at 56.4% and GPT-5.2 at 55.6%. Scores vary dramatically by scaffold — Scale AI's initial SWE-Agent results showed ~23% for top models, while newer scaffolds push scores to 45%+.

Terminal-Bench 2.0 Leaderboard: Top Scores February 2026

Codex CLI (GPT-5) leads Terminal-Bench 2.0 at 77.3%, followed by GPT-5.3-Codex at 75.1%. Droid with Claude Opus 4.6 scores 69.9%. Anthropic self-reports 65.4% for Opus 4.6 and 59.8% for Opus 4.5. Claude Code reaches 58.0% as a separate scaffold.

Open-Source Models on SWE-Bench 2026

DeepSeek V3.2 leads open-source models on SWE-Bench Verified at 73.0%. Qwen3-Coder-Next follows at 70.6% with only 3B active parameters. Kimi K2 from Moonshot AI scores 65.8%, GLM-4.7 from Zhipu AI reaches ~60%, and OpenAI's GPT-OSS-120B hits 62.4%. Open-source is closing the gap to proprietary frontier models fast.

Best AI Coding Model February 2026

The best model for coding depends on the workflow. Claude Opus 4.5 is the best model on SWE-Bench Verified for Python-heavy repository tasks at 80.9%, with Opus 4.6 at 80.8% and GPT-5.2 at 80.0%. Grok 4 from xAI scores 79.6% on Aider Polyglot.

For terminal and DevOps workflows, GPT-5.3-Codex leads Terminal-Bench 2.0 at 75.1%, though Codex CLI pushes that to 77.3% with agent-level scaffolding. Droid with Opus 4.6 reaches 69.9% on the same benchmark — useful if you need to stay within the Anthropic ecosystem.

Budget-conscious? DeepSeek V3.2-Exp delivers 74.2% on Aider Polyglot at $1.30/run — 22x cheaper than GPT-5. And Qwen3-Coder-Next punches well above its weight with 70.6% on SWE-Bench Verified using only 3B active parameters, making it the most efficient model in the top 10.

Frequently Asked Questions

Claude Opus 4.5 from Anthropic holds the highest self-reported SWE-Bench Verified score at 80.9%, closely followed by Claude Opus 4.6 at 80.8% and GPT-5.2 at 80.0%. Gemini 3 Flash scores 78.0%, Claude Sonnet 4.5 reaches 77.2%, and Gemini 3 Pro scores 76.2%. Grok 4 from xAI self-reports 72-75%.

Claude Opus 4.6 scores 80.8% on SWE-Bench Verified (second to Opus 4.5 at 80.9%) as of February 2026. Anthropic self-reports 65.4% on Terminal-Bench 2.0, and it reaches 69.9% when used with the Droid agent framework.

xAI self-reports 72-75% for Grok 4 on SWE-Bench Verified. Independent testing by vals.ai with SWE-agent scaffold shows 58.6% — a significant gap that highlights how scaffold choice affects results. On Aider Polyglot, Grok 4 scores 79.6%.

Qwen3-Coder-Next from Alibaba scores 70.6% on SWE-Bench Verified with only 3 billion active parameters. It also achieves 44.3% on SWE-Bench Pro and 36.2% on Terminal-Bench 2.0.

GPT-5.3-Codex from OpenAI leads SWE-Bench Pro with a score of 56.8%, closely followed by GPT-5.2-Codex at 56.4% and GPT-5.2 at 55.6%. Note: SWE-Bench Pro scores vary significantly by scaffold — Scale AI's initial SWE-Agent results showed ~23% for top models, while newer scaffolds push scores to 45%+.

It depends on the use case. Claude Opus 4.5 leads SWE-Bench Verified (80.9%) for Python-heavy tasks. GPT-5.3-Codex leads Terminal-Bench 2.0 (75.1%) and SWE-Bench Pro (56.8%) for agentic workflows. Grok 4 scores 79.6% on Aider Polyglot. DeepSeek V3.2-Exp offers the best cost-efficiency at $1.30 per run.

The top score jumped from around 65% in early 2025 to 80.9% in February 2026. Anthropic holds the #1 and #2 spots. GPT-5.2 surged to 80.0%. xAI entered with Grok 4 (72-75%). Alibaba's Qwen3-Coder-Next (70.6% at only 3B params) and Moonshot AI's Kimi K2 (65.8%) joined the top 15. Agent frameworks now outperform raw model scores by 10-20 points.

This leaderboard is updated monthly with the latest benchmark scores from SWE-Bench Verified, Terminal-Bench 2.0, Aider Polyglot, and SWE-Bench Pro. Scores are self-reported by model providers unless noted. Scaffold/harness differences affect results.

For pure code generation on Python repos, Claude Opus 4.5 leads at 80.9% on SWE-Bench Verified. For terminal-heavy DevOps workflows, GPT-5.3-Codex scores 75.1% on Terminal-Bench 2.0. For cost-efficiency, DeepSeek V3.2-Exp delivers 74.2% on Aider Polyglot at just $1.30 per run — 22x cheaper than GPT-5.

DeepSeek V3.2 leads open-source models on SWE-Bench Verified at 73.0%. Qwen3-Coder-Next follows at 70.6% with only 3B active parameters. Kimi K2 from Moonshot AI scores 65.8%, and GLM-4.7 from Zhipu AI reaches ~60%. On Aider Polyglot, DeepSeek V3.2-Exp scores 74.2% at just $1.30 per run.

Codex CLI (GPT-5) leads Terminal-Bench 2.0 at 77.3%, an agent-optimized setup. GPT-5.3-Codex scores 75.1%. Droid with Claude Opus 4.6 reaches 69.9%. Anthropic self-reports 65.4% for Opus 4.6 and 59.8% for Opus 4.5. GPT-5.2 reaches 62.2%.

Read the full analysis →

Let's
connect.

I am always open to exciting discussions about frontend architecture, performance, and modern web stacks.

Email me
Email me