Marco Patzelt
Back to Overview
February 9, 2026

Claude Code for $3/Month: Local Ollama Setup Guide

Run Claude Code on local models for $3/month instead of $90+ API bills. Full Ollama setup guide, best models for agentic coding, real Mac Mini benchmarks.

Claude Code is Anthropic's best agentic coding tool. It's also expensive — Opus 4.5 burns through API credits fast, and Claude Max at $90/month adds up. A team of engineers can easily hit $2,000+/month.

Since January 2026, you don't need Anthropic's API anymore. Ollama v0.14.0 added native Anthropic Messages API compatibility. Three environment variables, and Claude Code talks to local models instead. Zero API costs. Your code never leaves your machine.

Here's the complete setup, the models that actually work, and the honest performance reality.

How It Works

Claude Code doesn't care where its model lives. It speaks the Anthropic Messages API. Ollama now speaks that same protocol. Point Claude Code at localhost:11434 instead of api.anthropic.com, and it works the same way — file edits, tool calls, terminal commands, the full agentic loop.

The key difference: instead of sending your entire codebase to Anthropic's servers for inference, everything runs on your hardware. Privacy is absolute. Latency depends on your machine, not your internet connection.

What You Need

Software:

  • Ollama v0.14.0+ (v0.14.3-rc1 or later recommended for streaming tool calls)
  • Claude Code CLI (latest version)
  • Node.js 18+

Hardware (realistic minimums):

SetupRAMModels You Can RunTokens/SecCost
Mac Mini M4 24GB24GBGLM-4.7-Flash (Q4), small models20-30$599
Mac Mini M4 Pro 48GB48GBMost 30B models comfortably35-55$1,599
Mac Mini M4 Pro 64GB64GB32B models, some 70B quantized10-60$1,999
RTX 4090 24GB24GB VRAMGLM-4.7-Flash, fast120-220~$1,800 GPU

The Mac Mini M4 Pro 64GB at $1,999 is the sweet spot. Unified memory means no VRAM bottleneck. Runs 30B MoE models at usable speeds. Pays for itself in about 8 months vs API costs if you're a heavy user. I wrote a full breakdown of running a Mac Mini as an AI server if you need the hardware deep dive.

Setup: 5 Minutes, 4 Steps

Step 1: Install Ollama

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

# For full tool-call support, use pre-release:
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3-rc1 sh

On macOS, you can also download from ollama.com/download.

Verify it's running:

ollama version
# Should show 0.14.0 or higher

Step 2: Pull a Model

# Recommended starter — best tool-calling support
ollama pull glm-4.7-flash

# Alternative coding models
ollama pull qwen3-coder
ollama pull gpt-oss:20b

GLM-4.7-Flash is a 30B parameter MoE model with only 3B active parameters per token. That's why it's fast despite the large total size. 128K context window. Native tool-calling support — critical for Claude Code's agentic loop.

Step 3: Configure Environment

Quick way (new):

ollama launch claude

This handles everything automatically.

Manual way (more control):

Add to your ~/.bashrc, ~/.zshrc, or ~/.config/fish/config.fish:

export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434

Or set them in Claude Code's settings file at ~/.claude/settings.json:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"
  }
}

The CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC flag is optional but recommended — it prevents Claude Code from phoning home and ensures everything stays local.

Step 4: Launch

claude --model glm-4.7-flash

Or inline without persisting environment variables:

ANTHROPIC_AUTH_TOKEN=ollama \
ANTHROPIC_BASE_URL=http://localhost:11434 \
ANTHROPIC_API_KEY="" \
claude --model glm-4.7-flash

That's it. Claude Code is now running on your local model.

Best Models for Local Agentic Coding

Not all models work well with Claude Code. The agentic loop requires tool-calling support, decent context windows, and coding ability. Here's what actually performs:

ModelParams (Active)ContextTool CallingBest For
GLM-4.7-Flash30B (3B)128KNativeBest overall starter
Qwen3-Coder-30B30B (3B)256KYesCoding specialist
GPT-OSS-20B20B (dense)128KYesGeneral tasks
Devstral-2-Small24B128KYesLightweight option

My recommendation: Start with GLM-4.7-Flash. It has the best balance of speed, tool-calling reliability, and coding quality for Claude Code workflows. Ollama's own documentation recommends it for Claude Code integration.

Newsletter

Weekly insights on AI Architecture. No spam.

Qwen3-Coder-Next is the better pure coding model, but GLM-4.7-Flash has more reliable tool-calling — and tool-calling is what makes Claude Code agentic rather than just a chatbot.

The Reality Check

I'll be direct about what you're giving up.

What works well locally:

  • Routine refactoring and file edits
  • Test generation
  • Code review and analysis
  • Simple feature implementations
  • Documentation writing
  • Sensitive/proprietary code work

What still needs cloud models:

  • Complex multi-file architectural changes
  • Deep reasoning across large codebases
  • Novel algorithm design
  • Tasks requiring frontier-level intelligence

GLM-4.7-Flash scores 59.2% on SWE-bench Verified. That's impressive for a local model — it beats Qwen3-30B (22%) and GPT-OSS-20B (34%). But Opus 4.5 is still in a different league for complex reasoning.

The practical approach: use local for 80% of your daily coding tasks. Switch to cloud API when you hit something that genuinely needs frontier intelligence.

Context Length: The Hidden Gotcha

Claude Code eats context. Every file it reads, every command it runs, every tool call — all tokens. Ollama defaults to relatively short context windows.

Set a minimum of 20K for basic use, 32K+ for real projects:

# Set context length when starting Ollama
OLLAMA_NUM_CTX=32768 ollama serve

Or in your Modelfile:

FROM glm-4.7-flash
PARAMETER num_ctx 32768

Higher context = more RAM. On a 64GB Mac Mini, 32K context is comfortable. 64K is possible but cuts into model performance. 128K needs 48GB+ just for the KV cache on top of model weights.

DataCamp's testing found 20K context provides the best balance between functionality and speed for Claude Code workflows. Start there and increase only if you're hitting limits.

Common Issues and Fixes

"Connection refused" Ollama isn't running. Start it:

ollama serve

"Model not found" Check installed models and use the exact name:

ollama list

Tool calls failing / streaming errors You need Ollama 0.14.3-rc1 or later. Stable releases before this had issues with streaming tool calls that break Claude Code's agentic loop:

curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3-rc1 sh

Slow responses Expected on CPU. On Apple Silicon, make sure the model fits in unified memory — any page-out to SSD kills performance. Check with:

ollama ps
# Look for "100% GPU" in the PROCESSOR column

Verify it's truly local Disconnect from the internet and run a prompt. If you get a response, you're fully offline.

Switching Between Local and Cloud

You don't have to choose one. Use local for daily work, cloud for complex tasks.

Switch to local:

export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
claude --model glm-4.7-flash

Switch back to cloud:

unset ANTHROPIC_BASE_URL
unset ANTHROPIC_AUTH_TOKEN
claude  # Uses Anthropic API with your API key

Cost Comparison

Cloud API (Opus 4.5)Claude Max ($90/mo)Local (Mac Mini)
Hardware$0$0$1,999 one-time
Monthly Cost$200-2,000+$90~$3-5 electricity
Break-Even4-8 months
PrivacyCode sent to AnthropicCode sent to Anthropic100% local
Speed50+ tok/s50+ tok/s20-60 tok/s
IntelligenceFrontierFrontierGood enough for 80%

I was paying $90/month for Claude Max and another $30-40 for Gemini API. That's $130/month. The Mac Mini pays for itself in under a year, and after that it's basically free AI forever.

The Verdict

Mac Mini M4 Pro 64GB + Ollama + GLM-4.7-Flash. That's the setup.

It won't replace Opus 4.5 for complex architectural decisions. It will handle your daily refactoring, test writing, code review, and documentation — at zero marginal cost, with zero data leaving your machine.

If you also want a personal AI agent on your phone, OpenClaw on the same Mac Mini turns one box into both your coding assistant and your Telegram bot.

Start with ollama launch claude. Upgrade your model or hardware when you hit real limits, not imaginary ones.

Newsletter

Weekly insights on AI Architecture

No spam. Unsubscribe anytime.

Frequently Asked Questions

Yes. Since Ollama v0.14.0, Claude Code connects to local models via the Anthropic-compatible API. Three environment variables, one command, zero API costs, full privacy.

GLM-4.7-Flash. 30B MoE model with 3B active params, native tool-calling, 128K context. Ollama officially recommends it for Claude Code integration.

Minimum 24GB for basic models. 64GB unified memory (Mac Mini M4 Pro) is the sweet spot — runs 30B models at 20-60 tokens/second comfortably.

No. Local handles 80% of daily tasks well — refactoring, tests, code review. Complex multi-file architecture and deep reasoning still benefit from Opus 4.5 via cloud.

Run 'ollama launch claude' for automatic setup. Or set ANTHROPIC_AUTH_TOKEN=ollama, ANTHROPIC_BASE_URL=http://localhost:11434, then run 'claude --model glm-4.7-flash'.

Yes. Once downloaded, the model runs fully offline. Set CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 to prevent any network calls. Disconnect internet to verify.

GLM-4.7-Flash on Mac Mini M4 Pro 64GB: 35-60 tokens/second. Mac Mini M4 24GB: 20-30 tokens/second. RTX 4090: 120-220 tokens/second. Cloud is faster but costs money.

About $3-5/month in electricity on a Mac Mini. Hardware is $1,999 one-time. Breaks even vs a $90/month Claude Max subscription in under a year.

Let's
connect.

I am always open to exciting discussions about frontend architecture, performance, and modern web stacks.

Email me
Email me