Claude Code is Anthropic's best agentic coding tool. It's also expensive — Opus 4.5 burns through API credits fast, and Claude Max at $90/month adds up. A team of engineers can easily hit $2,000+/month.
Since January 2026, you don't need Anthropic's API anymore. Ollama v0.14.0 added native Anthropic Messages API compatibility. Three environment variables, and Claude Code talks to local models instead. Zero API costs. Your code never leaves your machine.
Here's the complete setup, the models that actually work, and the honest performance reality.
How It Works
Claude Code doesn't care where its model lives. It speaks the Anthropic Messages API. Ollama now speaks that same protocol. Point Claude Code at localhost:11434 instead of api.anthropic.com, and it works the same way — file edits, tool calls, terminal commands, the full agentic loop.
The key difference: instead of sending your entire codebase to Anthropic's servers for inference, everything runs on your hardware. Privacy is absolute. Latency depends on your machine, not your internet connection.
What You Need
Software:
- Ollama v0.14.0+ (v0.14.3-rc1 or later recommended for streaming tool calls)
- Claude Code CLI (latest version)
- Node.js 18+
Hardware (realistic minimums):
| Setup | RAM | Models You Can Run | Tokens/Sec | Cost |
|---|---|---|---|---|
| Mac Mini M4 24GB | 24GB | GLM-4.7-Flash (Q4), small models | 20-30 | $599 |
| Mac Mini M4 Pro 48GB | 48GB | Most 30B models comfortably | 35-55 | $1,599 |
| Mac Mini M4 Pro 64GB | 64GB | 32B models, some 70B quantized | 10-60 | $1,999 |
| RTX 4090 24GB | 24GB VRAM | GLM-4.7-Flash, fast | 120-220 | ~$1,800 GPU |
The Mac Mini M4 Pro 64GB at $1,999 is the sweet spot. Unified memory means no VRAM bottleneck. Runs 30B MoE models at usable speeds. Pays for itself in about 8 months vs API costs if you're a heavy user. I wrote a full breakdown of running a Mac Mini as an AI server if you need the hardware deep dive.
Setup: 5 Minutes, 4 Steps
Step 1: Install Ollama
# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh
# For full tool-call support, use pre-release:
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3-rc1 sh
On macOS, you can also download from ollama.com/download.
Verify it's running:
ollama version
# Should show 0.14.0 or higher
Step 2: Pull a Model
# Recommended starter — best tool-calling support
ollama pull glm-4.7-flash
# Alternative coding models
ollama pull qwen3-coder
ollama pull gpt-oss:20b
GLM-4.7-Flash is a 30B parameter MoE model with only 3B active parameters per token. That's why it's fast despite the large total size. 128K context window. Native tool-calling support — critical for Claude Code's agentic loop.
Step 3: Configure Environment
Quick way (new):
ollama launch claude
This handles everything automatically.
Manual way (more control):
Add to your ~/.bashrc, ~/.zshrc, or ~/.config/fish/config.fish:
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434
Or set them in Claude Code's settings file at ~/.claude/settings.json:
{
"env": {
"ANTHROPIC_BASE_URL": "http://localhost:11434",
"ANTHROPIC_AUTH_TOKEN": "ollama",
"CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"
}
}
The CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC flag is optional but recommended — it prevents Claude Code from phoning home and ensures everything stays local.
Step 4: Launch
claude --model glm-4.7-flash
Or inline without persisting environment variables:
ANTHROPIC_AUTH_TOKEN=ollama \
ANTHROPIC_BASE_URL=http://localhost:11434 \
ANTHROPIC_API_KEY="" \
claude --model glm-4.7-flash
That's it. Claude Code is now running on your local model.
Best Models for Local Agentic Coding
Not all models work well with Claude Code. The agentic loop requires tool-calling support, decent context windows, and coding ability. Here's what actually performs:
| Model | Params (Active) | Context | Tool Calling | Best For |
|---|---|---|---|---|
| GLM-4.7-Flash | 30B (3B) | 128K | Native | Best overall starter |
| Qwen3-Coder-30B | 30B (3B) | 256K | Yes | Coding specialist |
| GPT-OSS-20B | 20B (dense) | 128K | Yes | General tasks |
| Devstral-2-Small | 24B | 128K | Yes | Lightweight option |
My recommendation: Start with GLM-4.7-Flash. It has the best balance of speed, tool-calling reliability, and coding quality for Claude Code workflows. Ollama's own documentation recommends it for Claude Code integration.
Weekly insights on AI Architecture. No spam.
Qwen3-Coder-Next is the better pure coding model, but GLM-4.7-Flash has more reliable tool-calling — and tool-calling is what makes Claude Code agentic rather than just a chatbot.
The Reality Check
I'll be direct about what you're giving up.
What works well locally:
- Routine refactoring and file edits
- Test generation
- Code review and analysis
- Simple feature implementations
- Documentation writing
- Sensitive/proprietary code work
What still needs cloud models:
- Complex multi-file architectural changes
- Deep reasoning across large codebases
- Novel algorithm design
- Tasks requiring frontier-level intelligence
GLM-4.7-Flash scores 59.2% on SWE-bench Verified. That's impressive for a local model — it beats Qwen3-30B (22%) and GPT-OSS-20B (34%). But Opus 4.5 is still in a different league for complex reasoning.
The practical approach: use local for 80% of your daily coding tasks. Switch to cloud API when you hit something that genuinely needs frontier intelligence.
Context Length: The Hidden Gotcha
Claude Code eats context. Every file it reads, every command it runs, every tool call — all tokens. Ollama defaults to relatively short context windows.
Set a minimum of 20K for basic use, 32K+ for real projects:
# Set context length when starting Ollama
OLLAMA_NUM_CTX=32768 ollama serve
Or in your Modelfile:
FROM glm-4.7-flash
PARAMETER num_ctx 32768
Higher context = more RAM. On a 64GB Mac Mini, 32K context is comfortable. 64K is possible but cuts into model performance. 128K needs 48GB+ just for the KV cache on top of model weights.
DataCamp's testing found 20K context provides the best balance between functionality and speed for Claude Code workflows. Start there and increase only if you're hitting limits.
Common Issues and Fixes
"Connection refused" Ollama isn't running. Start it:
ollama serve
"Model not found" Check installed models and use the exact name:
ollama list
Tool calls failing / streaming errors You need Ollama 0.14.3-rc1 or later. Stable releases before this had issues with streaming tool calls that break Claude Code's agentic loop:
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3-rc1 sh
Slow responses Expected on CPU. On Apple Silicon, make sure the model fits in unified memory — any page-out to SSD kills performance. Check with:
ollama ps
# Look for "100% GPU" in the PROCESSOR column
Verify it's truly local Disconnect from the internet and run a prompt. If you get a response, you're fully offline.
Switching Between Local and Cloud
You don't have to choose one. Use local for daily work, cloud for complex tasks.
Switch to local:
export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
claude --model glm-4.7-flash
Switch back to cloud:
unset ANTHROPIC_BASE_URL
unset ANTHROPIC_AUTH_TOKEN
claude # Uses Anthropic API with your API key
Cost Comparison
| Cloud API (Opus 4.5) | Claude Max ($90/mo) | Local (Mac Mini) | |
|---|---|---|---|
| Hardware | $0 | $0 | $1,999 one-time |
| Monthly Cost | $200-2,000+ | $90 | ~$3-5 electricity |
| Break-Even | — | — | 4-8 months |
| Privacy | Code sent to Anthropic | Code sent to Anthropic | 100% local |
| Speed | 50+ tok/s | 50+ tok/s | 20-60 tok/s |
| Intelligence | Frontier | Frontier | Good enough for 80% |
I was paying $90/month for Claude Max and another $30-40 for Gemini API. That's $130/month. The Mac Mini pays for itself in under a year, and after that it's basically free AI forever.
The Verdict
Mac Mini M4 Pro 64GB + Ollama + GLM-4.7-Flash. That's the setup.
It won't replace Opus 4.5 for complex architectural decisions. It will handle your daily refactoring, test writing, code review, and documentation — at zero marginal cost, with zero data leaving your machine.
If you also want a personal AI agent on your phone, OpenClaw on the same Mac Mini turns one box into both your coding assistant and your Telegram bot.
Start with ollama launch claude. Upgrade your model or hardware when you hit real limits, not imaginary ones.