MiniMax just dropped M2.5, and the numbers are hard to ignore: 80.2% on SWE-Bench Verified, on par with Claude Opus 4.5, at 1/10th to 1/20th the price.
This isn't a minor update. Three versions in 108 days—M2, M2.1, now M2.5—and the improvement curve is steeper than any other model family right now, closed or open. Let's break down what's actually new, what the benchmarks mean, and whether you should care.
What's New in M2.5
M2.5 keeps the same MoE architecture as its predecessors: 230 billion total parameters, 10 billion active per token. The efficiency play hasn't changed. What changed is what they did with reinforcement learning.
MiniMax trained M2.5 across 200,000+ real-world environments in 10+ programming languages—Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, Ruby. Not just Python benchmarks. Real polyglot workflows.
The architect behavior: M2.5 developed what MiniMax calls "spec-writing tendency" during training. Before writing code, it decomposes the project into features, structure, and UI design like an experienced architect would. This isn't a prompt trick—it emerged from the RL training process.
Two versions, same capability:
| Version | Speed | Input Price | Output Price |
|---|---|---|---|
| M2.5-Lightning | 100 TPS | $0.30/1M | $2.40/1M |
| M2.5 (Standard) | 50 TPS | $0.15/1M | $1.20/1M |
For context: Claude Opus 4.6 runs at roughly 50 TPS. M2.5-Lightning is 2x that speed at a fraction of the cost.
The Benchmarks: What Actually Matters
Let's look at the numbers that matter for working developers, not abstract reasoning tests.
Coding:
| Benchmark | M2.5 | M2.1 | Opus 4.5 | Opus 4.6 |
|---|---|---|---|---|
| SWE-Bench Verified | 80.2% | 74.0% | ~78% | ~79% |
| Multi-SWE-Bench | 51.3% | 49.4% | — | — |
| SWE-Bench Pro | 55.4% | — | — | — |
| VIBE-Pro (full-stack) | ~Opus 4.5 level | 88.6 (VIBE) | Baseline | — |
80.2% on SWE-Bench Verified. That's not "competitive with"—that's beating most frontier models on the single most important coding benchmark. I did a full breakdown of how coding models compare on SWE-Bench recently—M2.5 would sit at the very top of that list.
Multi-SWE-Bench at 51.3% is #1, which measures real multilingual software engineering across actual codebases.
Agentic & Search:
| Benchmark | M2.5 | Notes |
|---|---|---|
| BrowseComp (w/ context mgmt) | 76.3% | Industry-leading |
| BFCL (tool calling) | 76.8% | SOTA |
| RISE (expert search) | Leading | New internal benchmark |
Scaffold Generalization: This is where it gets interesting. M2.5 was tested across different coding agent harnesses, not just its own:
- Droid: 79.7% (vs M2.1's 71.3%, Opus 4.6's 78.9%)
- OpenCode: 76.1% (vs M2.1's 72.0%, Opus 4.6's 75.9%)
Beating Opus 4.6 on unfamiliar scaffolding. That's not benchmark gaming—that's genuine generalization.
The Speed Story
Raw benchmark scores are one thing. How fast you get there matters for agentic workflows.
M2.5 completes SWE-Bench Verified tasks in an average of 22.8 minutes. M2.1 took 31.3 minutes. That's a 37% speed improvement. For reference, Claude Opus 4.6 averages 22.9 minutes.
Same speed as Opus. At a fraction of the cost.
Token efficiency also improved: 3.52M tokens per SWE-Bench task vs M2.1's 3.72M. And across BrowseComp, Wide Search, and RISE, M2.5 uses roughly 20% fewer rounds to reach answers. The model isn't just faster—it's smarter about how it searches.
The Cost Math
Here's where M2.5 gets genuinely interesting for anyone running agents at scale.
M2.5-Lightning (100 TPS): $1/hour of continuous operation. M2.5 Standard (50 TPS): $0.30/hour of continuous operation.
MiniMax's own claim: $10,000 runs 4 agents continuously for an entire year. Let's verify:
Weekly insights on AI Architecture. No spam.
4 agents × 365 days × 24 hours = 35,040 agent-hours. $10,000 ÷ 35,040 = $0.285/hour. ✓ Checks out at the 50 TPS tier.
Compare that to Opus 4.6 or GPT-5, where the output token cost alone is 10-20x higher per million tokens. For teams running nightly CI agents, multi-repo code review, or continuous research loops, the economics are fundamentally different.
The Office Angle
M2.5 isn't just a coding model anymore. MiniMax trained it with senior professionals from finance, law, and social sciences to produce what they call "truly deliverable outputs" in office scenarios—Word docs, PowerPoint decks, Excel financial modeling.
On their internal Cowork Agent evaluation (GDPval-MM), M2.5 achieves a 59.0% average win rate against mainstream models in professional document generation. Not groundbreaking, but notable that a coding-first model is now competitive in office productivity.
MiniMax claims 30% of their own company's daily tasks are now completed by M2.5 agents, and 80% of newly committed code is M2.5-generated. If accurate, that's a company eating its own dogfood at serious scale.
The Reality Check
Let's separate hype from substance.
What's verified:
- Benchmark numbers are independently confirmed. Graham Neubig (CMU, one of the most respected NLP researchers) tested it independently and confirmed it outperforms recent Claude Sonnet on coding tasks.
- The M2 series architecture (230B/10B MoE) is well-documented and open-source for previous versions.
- Pricing is live on OpenRouter and MiniMax's own platform. Already available on multiple providers.
- Previous M2 and M2.1 versions delivered on their benchmark claims in community testing.
What's unclear:
- M2.5 open-source status isn't explicitly confirmed yet. M2 and M2.1 were both fully open-sourced on HuggingFace. M2.5 weights may follow, but right now it's API-only.
- Office productivity claims are based on internal benchmarks (GDPval-MM) that aren't publicly available. Take the 59% win rate with appropriate skepticism.
- "Spec-writing" behavior sounds compelling but needs real-world validation beyond MiniMax's own examples.
- Verbosity was a known issue with M2. MiniMax claims M2.5 is more token-efficient, but independent confirmation is still early.
What's genuinely impressive: Three model versions in 108 days, each with meaningful improvements. M2 → M2.1 → M2.5 shows the fastest improvement curve in the industry on SWE-Bench Verified. That's not hype—that's velocity.
Who Should Care
Yes, switch to M2.5 if:
- You're running agentic coding workflows and cost matters (it always matters)
- You use Claude Code, Cline, OpenCode, Droid, or similar scaffolding—M2.5 generalizes across all of them
- You need multilingual coding (Go, Rust, C++, not just Python)
- You're building always-on agents where $1/hour vs $15/hour changes the business case entirely
Stay with Opus/GPT-5 if:
- You need peak general intelligence beyond coding (HLE scores still favor frontier closed models)
- Your workflow depends on specific Anthropic/OpenAI ecosystem features
- You're doing complex creative writing or nuanced reasoning tasks where the gap still exists
Wait and see if:
- You want to self-host. If M2.5 follows M2/M2.1's pattern, open weights will land on HuggingFace soon. Same 230B/10B architecture means the hardware requirements haven't changed.
Qwen3 Coder was the previous open-source leader on SWE-Bench—M2.5 just passed it by over 10 points.
The Verdict
MiniMax M2.5 is the real deal for coding and agentic workflows. 80.2% SWE-Bench Verified, Opus-level speed, at 1/10th the price. Open-source lineage. Fastest improvement curve in the industry.
For most dev teams running AI-assisted coding, M2.5-Lightning is now the default recommendation. Same output quality as Opus, 2x the speed, at a price that makes continuous agent operation a non-decision.
The open-source AI race just got another serious contender. And this one ships fast.