Marco Patzelt
Back to Overview
February 12, 2026

MiniMax M2.5: Opus-Level Coding for $1/Hour—Here's What's Real

MiniMax M2.5 hits 80.2% on SWE-Bench Verified at 1/10th the cost of Opus—open source, 100 TPS. Full benchmark breakdown, pricing, and what it means for devs.

MiniMax just dropped M2.5, and the numbers are hard to ignore: 80.2% on SWE-Bench Verified, on par with Claude Opus 4.5, at 1/10th to 1/20th the price.

This isn't a minor update. Three versions in 108 days—M2, M2.1, now M2.5—and the improvement curve is steeper than any other model family right now, closed or open. Let's break down what's actually new, what the benchmarks mean, and whether you should care.

What's New in M2.5

M2.5 keeps the same MoE architecture as its predecessors: 230 billion total parameters, 10 billion active per token. The efficiency play hasn't changed. What changed is what they did with reinforcement learning.

MiniMax trained M2.5 across 200,000+ real-world environments in 10+ programming languages—Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, Ruby. Not just Python benchmarks. Real polyglot workflows.

The architect behavior: M2.5 developed what MiniMax calls "spec-writing tendency" during training. Before writing code, it decomposes the project into features, structure, and UI design like an experienced architect would. This isn't a prompt trick—it emerged from the RL training process.

Two versions, same capability:

VersionSpeedInput PriceOutput Price
M2.5-Lightning100 TPS$0.30/1M$2.40/1M
M2.5 (Standard)50 TPS$0.15/1M$1.20/1M

For context: Claude Opus 4.6 runs at roughly 50 TPS. M2.5-Lightning is 2x that speed at a fraction of the cost.

The Benchmarks: What Actually Matters

Let's look at the numbers that matter for working developers, not abstract reasoning tests.

Coding:

BenchmarkM2.5M2.1Opus 4.5Opus 4.6
SWE-Bench Verified80.2%74.0%~78%~79%
Multi-SWE-Bench51.3%49.4%
SWE-Bench Pro55.4%
VIBE-Pro (full-stack)~Opus 4.5 level88.6 (VIBE)Baseline

80.2% on SWE-Bench Verified. That's not "competitive with"—that's beating most frontier models on the single most important coding benchmark. I did a full breakdown of how coding models compare on SWE-Bench recently—M2.5 would sit at the very top of that list.

Multi-SWE-Bench at 51.3% is #1, which measures real multilingual software engineering across actual codebases.

Agentic & Search:

BenchmarkM2.5Notes
BrowseComp (w/ context mgmt)76.3%Industry-leading
BFCL (tool calling)76.8%SOTA
RISE (expert search)LeadingNew internal benchmark

Scaffold Generalization: This is where it gets interesting. M2.5 was tested across different coding agent harnesses, not just its own:

  • Droid: 79.7% (vs M2.1's 71.3%, Opus 4.6's 78.9%)
  • OpenCode: 76.1% (vs M2.1's 72.0%, Opus 4.6's 75.9%)

Beating Opus 4.6 on unfamiliar scaffolding. That's not benchmark gaming—that's genuine generalization.

The Speed Story

Raw benchmark scores are one thing. How fast you get there matters for agentic workflows.

M2.5 completes SWE-Bench Verified tasks in an average of 22.8 minutes. M2.1 took 31.3 minutes. That's a 37% speed improvement. For reference, Claude Opus 4.6 averages 22.9 minutes.

Same speed as Opus. At a fraction of the cost.

Token efficiency also improved: 3.52M tokens per SWE-Bench task vs M2.1's 3.72M. And across BrowseComp, Wide Search, and RISE, M2.5 uses roughly 20% fewer rounds to reach answers. The model isn't just faster—it's smarter about how it searches.

The Cost Math

Here's where M2.5 gets genuinely interesting for anyone running agents at scale.

M2.5-Lightning (100 TPS): $1/hour of continuous operation. M2.5 Standard (50 TPS): $0.30/hour of continuous operation.

MiniMax's own claim: $10,000 runs 4 agents continuously for an entire year. Let's verify:

Newsletter

Weekly insights on AI Architecture. No spam.

4 agents × 365 days × 24 hours = 35,040 agent-hours. $10,000 ÷ 35,040 = $0.285/hour. ✓ Checks out at the 50 TPS tier.

Compare that to Opus 4.6 or GPT-5, where the output token cost alone is 10-20x higher per million tokens. For teams running nightly CI agents, multi-repo code review, or continuous research loops, the economics are fundamentally different.

The Office Angle

M2.5 isn't just a coding model anymore. MiniMax trained it with senior professionals from finance, law, and social sciences to produce what they call "truly deliverable outputs" in office scenarios—Word docs, PowerPoint decks, Excel financial modeling.

On their internal Cowork Agent evaluation (GDPval-MM), M2.5 achieves a 59.0% average win rate against mainstream models in professional document generation. Not groundbreaking, but notable that a coding-first model is now competitive in office productivity.

MiniMax claims 30% of their own company's daily tasks are now completed by M2.5 agents, and 80% of newly committed code is M2.5-generated. If accurate, that's a company eating its own dogfood at serious scale.

The Reality Check

Let's separate hype from substance.

What's verified:

  • Benchmark numbers are independently confirmed. Graham Neubig (CMU, one of the most respected NLP researchers) tested it independently and confirmed it outperforms recent Claude Sonnet on coding tasks.
  • The M2 series architecture (230B/10B MoE) is well-documented and open-source for previous versions.
  • Pricing is live on OpenRouter and MiniMax's own platform. Already available on multiple providers.
  • Previous M2 and M2.1 versions delivered on their benchmark claims in community testing.

What's unclear:

  • M2.5 open-source status isn't explicitly confirmed yet. M2 and M2.1 were both fully open-sourced on HuggingFace. M2.5 weights may follow, but right now it's API-only.
  • Office productivity claims are based on internal benchmarks (GDPval-MM) that aren't publicly available. Take the 59% win rate with appropriate skepticism.
  • "Spec-writing" behavior sounds compelling but needs real-world validation beyond MiniMax's own examples.
  • Verbosity was a known issue with M2. MiniMax claims M2.5 is more token-efficient, but independent confirmation is still early.

What's genuinely impressive: Three model versions in 108 days, each with meaningful improvements. M2 → M2.1 → M2.5 shows the fastest improvement curve in the industry on SWE-Bench Verified. That's not hype—that's velocity.

Who Should Care

Yes, switch to M2.5 if:

  • You're running agentic coding workflows and cost matters (it always matters)
  • You use Claude Code, Cline, OpenCode, Droid, or similar scaffolding—M2.5 generalizes across all of them
  • You need multilingual coding (Go, Rust, C++, not just Python)
  • You're building always-on agents where $1/hour vs $15/hour changes the business case entirely

Stay with Opus/GPT-5 if:

  • You need peak general intelligence beyond coding (HLE scores still favor frontier closed models)
  • Your workflow depends on specific Anthropic/OpenAI ecosystem features
  • You're doing complex creative writing or nuanced reasoning tasks where the gap still exists

Wait and see if:

  • You want to self-host. If M2.5 follows M2/M2.1's pattern, open weights will land on HuggingFace soon. Same 230B/10B architecture means the hardware requirements haven't changed.

Qwen3 Coder was the previous open-source leader on SWE-Bench—M2.5 just passed it by over 10 points.

The Verdict

MiniMax M2.5 is the real deal for coding and agentic workflows. 80.2% SWE-Bench Verified, Opus-level speed, at 1/10th the price. Open-source lineage. Fastest improvement curve in the industry.

For most dev teams running AI-assisted coding, M2.5-Lightning is now the default recommendation. Same output quality as Opus, 2x the speed, at a price that makes continuous agent operation a non-decision.

The open-source AI race just got another serious contender. And this one ships fast.

Newsletter

Weekly insights on AI Architecture

No spam. Unsubscribe anytime.

Frequently Asked Questions

MiniMax M2.5 is an open-source LLM for coding and agentic workflows. It uses a Mixture of Experts architecture with 230B total parameters, 10B active per token, scoring 80.2% on SWE-Bench Verified at 1/10th to 1/20th the cost of frontier closed models.

M2.5-Lightning (100 TPS) costs $0.30 per million input tokens and $2.40 per million output tokens—roughly $1/hour of continuous operation. The standard 50 TPS version costs half that at $0.30/hour.

M2.5 matches or exceeds Opus 4.6 on coding benchmarks: 80.2% SWE-Bench Verified vs ~79%. It beats Opus 4.6 on scaffold generalization (Droid: 79.7% vs 78.9%). Speed is identical at ~22.8 min per task. Key difference: M2.5 is 10-20x cheaper.

Previous M2 series models were open-sourced on HuggingFace under MIT and Modified-MIT licenses. M2.5 launched as API-only. Open weights are expected based on MiniMax's track record but haven't been confirmed yet.

M2.5 was trained on 10+ languages including Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby across 200,000+ environments. It scored #1 on Multi-SWE-Bench (51.3%).

On SWE-Bench Verified, M2.5 scores 80.2% vs DeepSeek V3.2's lower results. M2.5 also leads in agentic benchmarks like BrowseComp (76.3%) and tool calling (BFCL 76.8%). For coding and agent workflows, M2.5 currently leads the open-source field.

Not yet—M2.5 weights haven't been released at launch. Previous versions run locally via vLLM, SGLang, or MLX. Expect M2.5 weights to follow on HuggingFace based on MiniMax's open-source history.

M2.5 is tested on Claude Code, Droid (Factory AI), Cline, Kilo Code, OpenCode, Roo Code, and BlackBox. It uses Anthropic-compatible and OpenAI-compatible API endpoints for drop-in replacement.

Let's
connect.

I am always open to exciting discussions about frontend architecture, performance, and modern web stacks.

Email me
Email me