Marco Patzelt
Back to Overview
March 10, 2026

Claude Code Ran Overnight — 50+ AI Agent Repos From Scratch

I gave Claude Code a loop, a venture score, and one rule: never stop. It researched, built, and shipped 50+ working agent repos from one markdown file.

Karpathy released autoresearch and gave away something most people haven't understood yet. Not the code — the pattern. Give an AI a loop, a metric, and a single instruction: "NEVER STOP." He pointed it at neural net training. I pointed it at a different problem.

What if the agent doesn't optimize a loss function — it optimizes a Venture Score? Instead of tweaking train.py, it forks a seed harness. Instead of a GPU, it has an API key ring.

I went to bed. When I woke up, there were 50+ repos — built, committed, and ready to run.

The Pattern Is Stupidly Simple

I wrote about Karpathy's autoresearch pattern before — the insight is that constraints enable autonomy, not limit it. The autoresearch loop works because every component is bounded:

  1. Research — search Reddit, HN, GitHub for pain points
  2. Score — SIGNAL (3+ people want it?) + GAP (no free tool?) + FEASIBLE (2-3 tools?)
  3. Build — fork the seed harness, write specialized tools, customize the system prompt
  4. Validate — does npm install && npm run dev work? Does a test prompt return something useful?
  5. Commit — git commit everything, log to results.tsv
  6. GOTO 1

That's it. The entire architecture fits in a single markdown file called program.md. It's the only file a human touches. The loop ran for multiple sessions. It didn't stop at 5 agents. It kept researching, kept scoring, kept building. 50+ repos later, the pattern held.

Three Layers, No Framework

The system has three components. No LangChain, no CrewAI, no abstraction layers.

Layer 1: Claude Code (the meta-agent)

Claude Code is the researcher and builder. It searches Reddit, Hacker News, GitHub, and Twitter for real pain points. It reads threads, counts how many people are complaining, checks if good free solutions already exist, and scores each problem on a 6-point checklist.

Layer 2: The Seed Harness (the template)

A minimal agentic AI chat app — hand-written orchestration loop, model-agnostic provider layer, streaming UI. Every agent the system builds is a fork of this harness with 2-3 specialized tools swapped in. This is what I mean by bare metal over frameworks — the template is small enough to understand in 10 minutes.

Layer 3: Composio (the API layer)

Instead of hardcoding one integration per service (GitHub, Reddit, Google), the harness has 3 meta-tools that let built agents dynamically discover and use any of 250k+ API tools at runtime. One API key replaces all of them.

The Venture Score

Scoring is dead simple. 6 points, binary checklist:

Research quality (3 points):

  • SIGNAL — 3+ people asking for this solution online
  • GAP — no good free tool exists
  • FEASIBLE — solvable with 2-3 tools

Build quality (3 points):

  • INSTALLS — npm install succeeds without errors
  • RUNS — npm run dev starts the app
  • WORKS — a test prompt returns something useful

Build if research = 3/3. Ship if total >= 5/6.

This is the same principle Karpathy applied with val_bpb — a single, measurable metric that the agent can evaluate without human judgment. No subjective quality scores. No "does it feel good." Binary checks. At 50+ repos, the scoring system proved itself — it filtered aggressively. For every repo that shipped, 2-3 ideas got rejected at the research stage. The score prevented the system from building garbage.

The Context Limit Problem

Claude Code hits context limits after 2-4 hours. That's the real bottleneck — not cost, not speed. Context window exhaustion kills the loop.

The fix is a 28-line bash script. When context fills up, the agent commits everything and writes a handoff note to research/next-session.md. The wrapper script restarts Claude Code, which reads the handoff and picks up where it left off.

Infinite sessions from finite context. The research compounds because every session reads what the previous sessions learned. This is bounded autonomy in practice — the agent operates within hard constraints but gets smarter over time. At 50+ repos, the restart wrapper ran dozens of times. Each restart was seamless — zero information loss.

Newsletter

Weekly insights on AI Architecture. No spam.

What Got Built

50+ repos across categories:

Developer tools — dependency analyzers, changelog summarizers, npm trust checkers, repo health scanners, code review assistants, migration planners.

Security — job scam detectors, phishing analyzers, CVE monitors, supply chain auditors.

Research — company briefing generators, competitor analysis agents, market gap finders, trend trackers.

Productivity — meeting prep agents, email drafters, documentation generators, onboarding assistants.

Each one is a self-contained Next.js app. Clone it, add your .env, run npm install && npm run dev, and you have a working AI agent in 60 seconds. Two API keys: one for the LLM (OpenRouter), one for API access (Composio).

Not all 50+ are production-grade. But every single one installs, runs, and responds to a test prompt. The venture score guaranteed that baseline. The quality distribution looks like a power law — a handful are genuinely useful, most are decent MVPs, and the tail end are narrow tools that solve real but small problems.

The 2-3 Tool Constraint

Every agent follows the same pattern:

  • Tool 1: GATHER — Get raw input from the world
  • Tool 2: PROCESS — Transform, analyze, or enrich
  • Tool 3: OUTPUT — Deliver the result

2 tools minimum, 3 maximum. If you need 4+, the problem is too broad. This constraint is borrowed directly from Karpathy's simplicity criterion — simpler beats clever at equal performance.

The constraint forces focus. A dependency analyzer doesn't need to fetch changelogs AND compare versions AND run compatibility checks AND generate migration guides. It fetches, summarizes, and flags urgency. Three tools. Done. This held at 50+ repos. The agent never needed to violate the constraint — it just got better at decomposing problems into the right 2-3 tools.

Research Compounds

This is the part that surprised me. The research log grew with every session. The agent read its own history before starting — it knew what it had tried, what failed, and what patterns worked. Every 5 builds, it wrote a meta-reflection analyzing its own performance.

By repo 20, the hit rate was noticeably higher. The agent stopped proposing tools that overlapped with existing ones. It started identifying niches instead of broad categories. It learned that developer tools scored higher than consumer tools. It learned that security-adjacent problems had bigger gaps than productivity problems.

Failed builds weren't wasted. The research behind them was documented — signals, gaps, existing solutions, user quotes. Even a failed build produced a knowledge base entry that made future research better.

Everything is on disk, everything is committed. Nothing lives only in the AI's context. cat results.tsv gives you the full picture in the morning.

What This Actually Proves

This isn't about the 50+ repos. Those are outputs. The interesting part is the pattern:

  1. Loop + metric + "never stop" scales. The architecture didn't break at 50. It got better. The constraints did the work.
  2. Research compounds across sessions. The agent genuinely improved at problem selection over time. The meta-reflections weren't decoration — they were real feedback loops.
  3. 2-3 tools is the right constraint at any scale. Gather, Process, Output covered everything the system needed for 50+ different problems. More tools means more complexity means worse results.
  4. Context limits are the real bottleneck. Not cost, not speed. The restart wrapper ran dozens of times and never lost state. A model with infinite context wouldn't need it — but the workaround holds.
  5. API unification is a cheat code. Replacing hardcoded API integrations with meta-tools that discover and use any API at runtime was the key unlock. Without Composio, each agent would need custom integration code. With it, the agent just describes what it needs.

The whole thing runs on two API keys and a markdown file. Point Claude Code at program.md, go to sleep, wake up to repos.

The pattern is transferable. Karpathy built it for training runs. I applied it to agent building. Someone else could apply it to market research, security auditing, content creation — anything where you can define a score and a loop. The architecture doesn't change. Only the metric does. I just proved it works at 50x the original scale.

Newsletter

Weekly insights on AI Architecture

No spam. Unsubscribe anytime.

Frequently Asked Questions

Using a loop pattern inspired by Karpathy's autoresearch: research pain points, score them on a 6-point venture score, fork a seed harness, build, validate, commit, and repeat. A restart wrapper handles context limits across sessions.

A 6-point binary checklist: 3 points for research quality (signal, gap, feasibility) and 3 for build quality (installs, runs, works). Build if research scores 3/3, ship if total is 5/6 or higher.

Every agent uses exactly 2-3 tools: Gather (get input), Process (analyze), Output (deliver result). If you need 4+ tools, the problem is too broad. This constraint held across all 50+ repos without exception.

Each session reads the previous research log before starting. Every 5 builds, the agent writes a meta-reflection. By repo 20, hit rates improved — the agent learned which problem categories score higher and avoided overlap.

Developer tools (dependency analyzers, npm trust checkers), security agents (scam detectors, CVE monitors), research agents (company briefings, competitor analysis), and productivity tools (meeting prep, documentation generators).

A 28-line bash wrapper script. When context fills up, the agent commits everything and writes a handoff note. The wrapper restarts Claude Code, which reads the handoff and continues. Zero information loss across dozens of restarts.

Let's
connect.

I write about what I build, I build what I'm curious about, and I'm always up for a real conversation about systems, agents, or legacy architecture. If you have a problem that sounds like plumbing — reach out.

Email me
Email me