SWE-Bench Verified Leaderboard April 2026
Current AI model rankings and latest top scores across SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench 2.0 & Aider Polyglot — updated April 2026.
Source: swebench.com ↗
Scores are self-reported by model providers unless noted. Scaffold/harness differences affect results.
SWE-Bench Verified Leaderboard: Claude Opus 4.7 Takes #1
Claude Opus 4.7 from Anthropic now leads SWE-Bench Verified at 87.6% following its April 16, 2026 release with 1M context. GPT-5.3-Codex follows at 85.0%. Claude Opus 4.5 sits at 80.9%, Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, MiniMax M2.5 at 80.2%, and GPT-5.2 at 80.0%. Qwen3.6 Plus from Alibaba (April 2026, 78.8%) and Muse Spark from Meta (77.4%) round out the major April entrants.
Terminal-Bench 2.0: ForgeCode Scaffold Tops the Board
ForgeCode with Claude Opus 4.6 and ForgeCode with GPT-5.4 are tied for #1 on Terminal-Bench 2.0 at 81.8%. TongAgents with Gemini 3.1 Pro reaches 80.2%. SageAgent + GPT-5.3-Codex and ForgeCode + Gemini 3.1 Pro both hit 78.4%. Factory.ai's Droid scaffold with GPT-5.3-Codex follows at 77.3%. Anthropic self-reports Claude Opus 4.7 at 69.4%, pending submission to tbench.ai.
SWE-Bench Pro: Claude Opus 4.7 Leads at 64.3%
On SWE-Bench Pro, Claude Opus 4.7 leads at 64.3% (Anthropic-reported, April 2026). GPT-5.4 (xHigh) reaches 59.1% on Scale's SEAL mini-swe-agent scaffold. GPT-5.3-Codex (agent system) scores 56.8%, GPT-5.2-Codex 56.4%, and Muse Spark from Meta 55.0%. Claude Opus 4.6 scores 51.9% on the SEAL mini-swe-agent harness. Scale's fully standardized SEAL board puts Claude Opus 4.5 in the lead at 45.9%.
Open-Source Models on SWE-Bench 2026
MiniMax M2.5 leads open-weight models on SWE-Bench Verified at 80.2%, still in the top 10 overall. MiMo-V2-Pro from Xiaomi reaches 78.0% with 1T parameters. GLM-5 from Zhipu AI follows at 77.8% with 744B parameters trained on Huawei chips. Kimi K2.5 from Moonshot AI scores 76.8%. GLM-4.7 reaches 73.8% (corrected upward from earlier reports). DeepSeek V3.2 hits 73.0%, and Qwen3-Coder-Next achieves 70.6% with only 3B active parameters.
Best AI Coding Model April 2026
Claude Opus 4.7 is the clear overall leader in April 2026 — 87.6% on SWE-Bench Verified and 64.3% on SWE-Bench Pro, both #1. GPT-5.3-Codex follows at 85.0% on SWE-Bench Verified. Claude Sonnet 4.6 punches above its weight at 79.6% — still only 1.2 points behind Opus 4.6 and 5x cheaper.
For terminal and DevOps workflows, ForgeCode scaffolds with Claude Opus 4.6 or GPT-5.4 top Terminal-Bench 2.0 at 81.8%. TongAgents + Gemini 3.1 Pro reaches 80.2%. On multi-language editing (Aider Polyglot), Claude Opus 4.5 leads at 89.4% (Anthropic-reported), with GPT-5 (high) at 88.0%.
Budget-conscious? DeepSeek V3.2-Exp delivers 74.2% on Aider Polyglot at $1.30/run — 22x cheaper than GPT-5. Qwen3-Coder-Next scores 70.6% on SWE-Bench Verified using only 3B active parameters, the most efficient model in the top 25.
Frequently Asked Questions
Claude Opus 4.7 from Anthropic leads SWE-Bench Verified at 87.6%, released April 16, 2026. GPT-5.3-Codex from OpenAI follows at 85.0%. Next come Claude Opus 4.5 at 80.9%, Claude Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, MiniMax M2.5 at 80.2%, and GPT-5.2 at 80.0%. Claude Sonnet 4.6 scores 79.6%. Qwen3.6 Plus reaches 78.8% as Alibaba's new flagship.
Claude Opus 4.7 scores 87.6% on SWE-Bench Verified, 64.3% on SWE-Bench Pro, and Anthropic reports 69.4% on Terminal-Bench 2.0 (not yet on the public tbench.ai board). Released April 16, 2026 with 1M token context, it now leads all publicly-available models on SWE-Bench Verified and SWE-Bench Pro.
Claude Opus 4.6 scores 80.8% on SWE-Bench Verified and 51.9% on SWE-Bench Pro (Scale SEAL mini-swe-agent). On Terminal-Bench 2.0, Opus 4.6 reaches 74.7% (Terminus-KIRA scaffold), and ForgeCode + Opus 4.6 tops the board at 81.8%. Released January 2026.
Gemini 3.1 Pro from Google DeepMind scores 80.6% on SWE-Bench Verified as of February 2026. On Terminal-Bench 2.0, TongAgents + Gemini 3.1 Pro reaches 80.2% and ForgeCode + Gemini 3.1 Pro reaches 78.4%. On SWE-Bench Pro (Scale SEAL mini-swe-agent), it scores 46.1%.
xAI self-reports 72-75% for Grok 4 on SWE-Bench Verified. Independent testing by vals.ai with SWE-agent scaffold shows 58.6% — a significant gap that highlights how scaffold choice affects results. On Aider Polyglot, Grok 4 scores 79.6%. xAI has since released Grok 4.20, now its current flagship.
Claude Sonnet 4.6 scores 79.6% on SWE-Bench Verified, only 1.2 points behind Opus 4.6 and 2.4 points ahead of Sonnet 4.5. At $3/$15 per million tokens — five times cheaper than Opus — it offers strong cost-efficiency for coding tasks.
Claude Opus 4.7 leads SWE-Bench Pro at 64.3% (Anthropic-reported, April 2026). GPT-5.4 (xHigh) scores 59.1% on Scale SEAL mini-swe-agent. Agent-system scores: GPT-5.3-Codex CLI at 56.8%, GPT-5.2-Codex at 56.4%, GPT-5.2 at 55.6%. Muse Spark from Meta reaches 55.0%. On Scale SEAL standardized scaffolding, Claude Opus 4.5 leads at 45.9%.
Claude Opus 4.7 leads SWE-Bench Verified (87.6%) and SWE-Bench Pro (64.3%) as of April 2026 — the new overall leader for coding. GPT-5.3-Codex reaches 85.0% on SWE-Bench Verified. ForgeCode scaffolds with Opus 4.6 or GPT-5.4 top Terminal-Bench 2.0 at 81.8%. On Aider Polyglot, Claude Opus 4.5 leads at 89.4% (Anthropic-reported). DeepSeek V3.2-Exp offers the best cost-efficiency at $1.30 per run.
The top score jumped from around 65% in early 2025 to 87.6% in April 2026 with Claude Opus 4.7. Anthropic holds the #1 spot; GPT-5.3-Codex at 85.0% is #2. Gemini 3.1 Pro sits at 80.6%, MiniMax M2.5 at 80.2% as an open-weight model, GPT-5.2 at 80.0%, and Claude Sonnet 4.6 at 79.6%. New entrants include Qwen3.6 Plus (78.8%), Muse Spark from Meta (77.4%), and MiMo-V2-Pro from Xiaomi (78.0%). Agent frameworks outperform raw model scores by 5-15 points.
This leaderboard is updated monthly with the latest benchmark scores from SWE-Bench Verified, Terminal-Bench 2.0, Aider Polyglot, and SWE-Bench Pro. Scores are self-reported by model providers unless noted. Scaffold/harness differences affect results.
For pure code generation, Claude Opus 4.7 leads SWE-Bench Verified at 87.6% and SWE-Bench Pro at 64.3%. For terminal and DevOps workflows, ForgeCode + Opus 4.6 or GPT-5.4 tops Terminal-Bench 2.0 at 81.8%. For multi-language editing, Claude Opus 4.5 leads Aider Polyglot at 89.4%. For cost-efficiency, DeepSeek V3.2-Exp delivers 74.2% on Aider Polyglot at just $1.30 per run.
MiniMax M2.5 leads open-weight models on SWE-Bench Verified at 80.2%, still in the top 10 overall. MiMo-V2-Pro from Xiaomi reaches 78.0% with 1T parameters. GLM-5 from Zhipu AI follows at 77.8% with 744B parameters. Kimi K2.5 scores 76.8%. GLM-4.7 reaches 73.8% (corrected from earlier reports). DeepSeek V3.2 hits 73.0%, Qwen3-Coder-Next 70.6% with only 3B active parameters.
ForgeCode + Claude Opus 4.6 and ForgeCode + GPT-5.4 are tied at 81.8% on Terminal-Bench 2.0 as of April 2026. TongAgents + Gemini 3.1 Pro reaches 80.2%. SageAgent + GPT-5.3-Codex and ForgeCode + Gemini 3.1 Pro both hit 78.4%. Droid + GPT-5.3-Codex from Factory scores 77.3%. Anthropic reports Claude Opus 4.7 at 69.4% (not yet on the public tbench.ai board).
Let's
connect.
I build middleware by day and autonomous agent systems by night. If you're working on something serious in agentic infrastructure, I'd like to hear about it.