The Hardware Revolution
The dream of running your own 24/7 AI assistant on dedicated hardware is now reality. OpenClaw (formerly Clawdbot, formerly Moltbot) has emerged as the go-to solution for self-hosted AI agents, and the Mac Mini M4 has become the hardware of choice for developers who want local inference without cloud dependency.
This guide covers everything: hardware decisions, model selection, Ollama configuration, and how to connect Claude Code to local models. Whether you're after privacy, cost savings, or just the satisfaction of running trillion-parameter models in your attic, this is your roadmap.
Why Mac Mini for Local AI?
Apple Silicon changed the game for local LLM inference. The unified memory architecture means no data shuffling between CPU RAM and GPU VRAM—everything shares one pool. For AI workloads that are memory-bandwidth bound, this eliminates the biggest bottleneck.
Key advantages:
- Unified Memory: No copying between system RAM and VRAM. A 64GB Mac Mini can allocate most of that to model inference.
- Power Efficiency: A Mac Mini draws 20-40W under load. Compare that to an RTX 4090 system pulling 500W+.
- Silent Operation: No GPU fans screaming at you during inference.
- Always-On Ready: Low power draw makes 24/7 operation practical.
The Mac Mini M4 Pro with 64GB unified memory has become the sweet spot for serious local AI work. Jeff Geerling's testing shows this configuration comfortably running 32B parameter models at 11-12 tokens per second—fast enough for real-time coding assistance.
Hardware Recommendations by Use Case
Budget Setup: Mac Mini M4 (24GB) — ~$800
What you can run: 7-8B parameter models (Llama 3.1 8B, DeepSeek Coder 6.7B, Qwen2.5-Coder 7B) Performance: ~15-20 tokens/second Reality check: Good for experimentation, but you'll hit memory pressure quickly. The 24GB configuration works but you're limited to smaller models with aggressive quantization.
Recommended Setup: Mac Mini M4 Pro (64GB) — ~$2,000
What you can run: 30-32B parameter models, MoE models like Qwen3-Coder-30B-A3B Performance: ~10-15 tokens/second on 32B models Why it's the sweet spot: 64GB lets you run Qwen2.5-Coder-32B, the most capable coding model that fits on consumer hardware. You can have multiple models loaded simultaneously and still have headroom for the OS.
Enthusiast Setup: Mac Studio M3 Ultra (256GB-512GB) — $7,000-$10,000
What you can run: 70B+ models, DeepSeek-R1 671B (quantized), Kimi K2 (with heavy quantization) Performance: ~5-10 tokens/second on massive models The truth about Kimi K2: The 1 trillion parameter Kimi K2 model requires 250GB+ just for the weights. Even with a 512GB Mac Studio, you're running heavily quantized versions (1.8-bit) at 1-2 tokens per second. It works, but it's not practical for daily use.
The "Porsche Money" Setup: 4x Mac Studio M3 Ultra Cluster — $40,000+
Jeff Geerling demonstrated running Kimi K2 Thinking at 28-30 tokens/second across four Mac Studios connected via Thunderbolt 5 using RDMA and the Exo framework. This is bleeding edge—macOS 26.2 introduced RDMA over Thunderbolt 5 specifically for this use case.
If you have this budget, you're in genuine frontier model territory. But for 99% of developers, the Mac Mini M4 Pro 64GB is the right answer.
Best Local Models for Coding (2026)
Model selection matters more than hardware. Here's what actually works for agentic coding tasks:
Tier 1: Best for OpenClaw and Claude Code
GLM-4.7-Flash (9B active, 128K context) The current recommendation from Ollama for Claude Code integration. Excellent tool-calling support, 128K context window, and runs well on 24GB+ systems. This is the model to start with.
ollama pull glm-4.7-flash
Qwen3-Coder-30B-A3B (30B total, 3B active per token) A Mixture-of-Experts model optimized for coding. The MoE architecture means only 3B parameters are active at inference time, so it's fast despite the large total size. Supports 256K context and native tool calling. Requires 64GB RAM.
ollama pull qwen3-coder:30b
GPT-OSS-20B OpenAI's first open-weights model. Broad ecosystem support (Ollama, vLLM, LM Studio) and good general-purpose coding. A pragmatic choice that "just works."
ollama pull gpt-oss:20b
Tier 2: Excellent but Resource Hungry
DeepSeek-Coder-V2 (16B) Strong multilingual coding support (300+ languages) and excellent for repository-level tasks. Good for developers working with less common languages.
Codestral-22B Mistral's purpose-built coding model. 32K context, strong on structured outputs. A solid single-GPU choice if you prefer Mistral's style.
Tier 3: Frontier (If You Have the Hardware)
Kimi K2 / K2.5 (1T parameters, 32B active) State-of-the-art agentic coding capabilities rivaling Claude Sonnet 4. But the full model needs 250GB+ storage and 247GB+ RAM for reasonable speeds. The Unsloth 1.8-bit quantized version (245GB) can run on a single 512GB Mac Studio at 1-2 tokens/second.
For most developers, Kimi K2 is better accessed via API than run locally.
Qwen3-Coder-480B-A35B Alibaba's flagship coding model. Matches Claude Sonnet 4 on benchmarks. Requires multi-GPU clusters or Mac Studio clusters—not a consumer option.
OpenClaw Setup: Step by Step
OpenClaw is a gateway that connects AI models to messaging platforms (WhatsApp, Telegram, Slack, Discord, iMessage). You message it like a coworker, and it can browse the web, run commands, manage files—anything a person could do at a keyboard.
Prerequisites
- Node.js 22+
- Ollama installed and running
- A messaging platform account (Telegram is easiest to start)
Installation
# Install OpenClaw globally
npm install -g openclaw@latest
# Run the onboarding wizard
openclaw onboard --install-daemon
The wizard walks you through:
- Gateway configuration (local vs. remote)
- Model provider selection
- Channel setup (WhatsApp, Telegram, etc.)
- Skills configuration
Configuring Ollama as the Model Provider
During onboarding, select "OpenAI-compatible" as your provider, then configure:
{
"agent": {
"model": "ollama/glm-4.7-flash",
"baseUrl": "http://localhost:11434/v1"
}
}
Connecting to Telegram
- Open Telegram and search for @BotFather
- Send
/newbotand follow the prompts - Copy the bot token BotFather provides
- Add it to your OpenClaw config:
{
"channels": {
"telegram": {
"botToken": "YOUR_BOT_TOKEN"
}
}
}
Restart the gateway: openclaw gateway restart
Important: Context Length
OpenClaw requires at least 64K tokens context length. When using Ollama, verify your model supports this:
ollama show glm-4.7-flash --modelfile
If needed, create a custom Modelfile to increase context:
FROM glm-4.7-flash
PARAMETER num_ctx 65536
Claude Code + Ollama: Local Agentic Coding
Claude Code is Anthropic's terminal-based coding agent. Since Ollama v0.14.0 (January 2026), you can run Claude Code against local models via Ollama's Anthropic-compatible API.
Setup
Install Claude Code:
curl -fsSL https://claude.ai/install.sh | bash
Configure environment variables:
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL="http://localhost:11434"
Or add to ~/.zshrc / ~/.bashrc for persistence.
Running Claude Code with Local Models
One-liner:
ollama launch claude
Or run directly:
claude --model qwen3-coder:30b
The Hybrid Approach: When to Use What
The smartest setup isn't purely local or purely cloud—it's using the right tool for each task:
Use local models for:
- Prototyping and iteration
- Sensitive code that can't leave your machine
- Learning and experimentation
- Offline development
Use cloud APIs for:
- Production-critical code review
- Complex architectural decisions
- Tasks requiring state-of-the-art reasoning
- When speed matters
OpenClaw and Claude Code both support model routing. You can configure fallbacks:
{
"agent": {
"model": "ollama/glm-4.7-flash",
"fallback": "anthropic/claude-sonnet-4"
}
}
The Verdict
The local AI stack in 2026 is genuinely capable. A Mac Mini M4 Pro running OpenClaw with Qwen3-Coder-30B gives you a 24/7 AI assistant that:
- Never sends your code to the cloud
- Costs nothing after hardware purchase
- Works offline
- Integrates with your existing messaging apps
Is it as good as Claude Opus 4.5 via API? No. Is it good enough for most development tasks? Absolutely.
The temptation to buy stacked Mac Studios is real—but for most developers, a single Mac Mini M4 Pro with 64GB is the pragmatic choice. Start there, and upgrade when you hit actual limits, not imagined ones.