Code w/ Claude London 2026: Managed Agents, Memory + Dreaming, Evals

Code w/ Claude London, my conference lanyard

I was at Code w/ Claude in London on May 19, Anthropic's dev conference. The thing that stuck with me most was Boris Cherny's keynote. Not some announcement, his story. He started out as a builder too, small team, a project they almost gave up on, and today it's one of the most important products at Anthropic. Hearing that from someone who sat exactly where you're sitting now, that lands different. Here are the three things still stuck with me a week later, plus one lens I picked up afterwards.

The main stage at Code w/ Claude London during the keynote

Managed Agents: session state handled for you

I already knew Managed Agents, and the infra behind them. My instinct has always been: I'd rather build it myself. Isabella He's workshop was a clean reminder of what the thing actually takes off your plate.

The usual way: an agent has a context window, it dies between sessions, so you persist everything yourself. Task board, intermediate results, tool call history. That means a database, migrations, infra you maintain for something that changes every week.

Managed Agents flip that. The session state (messages, tool calls, results) lives server-side at Anthropic, I just hold a session ID, and the history replays through sessions.events.list(). No schema design, no backups, audit trail by default.

Isabella He presenting Ship your first Managed Agent

I'm honestly split on this. On one side, great, a huge pile of infra work just disappears. On the other, you hand over a lot of control. What if I want my memory to work completely differently, or smarter workflows running inside it? And Anthropic's models aren't exactly cheap, so the running cost is a real factor. For what I build the trade mostly works, but I'm going in with my eyes open.

Memory + Dreaming: a filesystem and a curator

The workshop I keep coming back to.

I had Karpathy's LLM wiki in my head walking in, basically Obsidian for agents. An agent reads raw transcripts, pulls out the concepts, writes clean linked markdown. Knowledge builds up instead of getting re-derived from RAG chunks on every query. That's exactly what Anthropic showed, just as a production version, in two parts.

Part one: the store is a filesystem. Everything lives in a /memory directory that persists between sessions. The agent greps for the keyword it needs, finds the file, reads just that chunk. Same shape as Claude Code in a repo. The context window stays small even as the store grows.

The Dreaming slide, step 4 of the memory workshop

Part two, and this is the move that makes it click: Dreaming. An async subagent that walks the memory tree between sessions and cleans it up. Merges duplicates, drops stale stuff, restructures the layout. No in-place edit, it builds a fresh store you swap in next time. That's Karpathy's raw to wiki step exactly, just managed instead of running locally against your own Obsidian vault.

Kevin's line that stuck: memory isn't where the agent stores answers, it's where it stores lessons.

Agent Decomposition: the hard part is the interface

William Steuk on stage with the StockPilot orchestrator slide

Sounds simple: when your orchestrator has too many tools (somewhere past 15 to 20) you group related ones into subagents, and the orchestrator calls subagents instead of tools. Fine so far.

William's actual point:

The hard part isn't picking the split. It's defining the interface between the layers.

Once you have an orchestrator plus subagents, you have a protocol. What goes in, what comes back, in what format, what happens when a subagent half-fails. If that's vague the whole system degrades and you can barely find the bug, because it's smeared across the boundary. Same lesson as microservices in the 2010s, just for agents.

Newsletter

Weekly insights on AI Architecture. No spam.

My SEO agent has 13 tools on one orchestrator right now and runs fine. But I can see the clusters already (GSC, paid audit, content briefs, Webflow publishing). The split is easy, the contract between them is the work. It runs clean today, so I'll get that pattern in before I need it, not after it breaks.

The bigger pattern: hill climbing on the eval

The line that hit hardest:

Hill climb on the eval.

William Steuk (paraphrased)

This was the thread through every workshop. Your eval suite is a landscape, every change is a step, the eval gives you the gradient. Without it you're not engineering, you're just vibing.

The thing that really landed: LLMs can't write their own evals. Apply one, sure (LLM-as-judge works), but the rubric has to come from a human with taste. And then the line: sometimes you look at something, you know it's bad, but you can't say why. That gap is the whole job.

That was my bridge. I'd had evals before in pieces, but never the "ah, this is how you write a good one" moment. The job isn't the grading code, that's easy. The job is putting the why into words. Once you can say why something is bad, the eval almost writes itself.

In practice it splits in two. Code graders: deterministic, free, run on every PR (is there output, is the word count in range, is the keyword in the H1). Judge graders: LLM-as-judge with 0 to 5 scores (does the intro hook, is the tone on brand). Both into one scorecard, red yellow green, deltas against a baseline, you look before and after every change. And batch the judge calls, one for all dimensions instead of one each.

I left London knowing this is my biggest gap. Not "I have no evals", more "I never knew how to write good ones." Now I do, and it's going into the SEO agent this quarter.

The lens I picked up after: The Bitter Lesson

A speaker mentioned Rich Sutton's The Bitter Lesson in passing, a "go read this" aside. So I did, on the way home. Two pages from 2019.

Short version: across 70 years of AI, the general methods that ride compute beat the clever handcrafted ones. Every time. Applied to harness engineering, my takeaway wasn't "stop building the harness layer", it was: watch how much you engineer, don't build something the next model just absorbs.

You can see it happening. 2022 prompt-engineering magic, gone. 2023 hand-rolled JSON parsers, gone. 2024 fat RAG pipelines, shrinking. 2025 hand-managed memory and custom orchestration, and that's exactly what's shrinking now with Managed Agents and Memory + Dreaming.

So the question: what do I build deeply, what's throwaway? Deep: evals, general-purpose tools, production substrate (sandboxes, permissions, audit trails), domain knowledge, taste. Throwaway: most prompt engineering, hand-rolled memory schemes, custom orchestration, narrow tools where a general one would do. Short form: encode your taste as evals, not as rules. Rules get eaten when the model gets smarter, evals compound.

What I'm changing

Three things for the next month.

Real evals for the SEO agent. Start small, ten to twenty inputs by hand, define what good looks like, run before and after every change. Stop optimising on vibes.

Audit the 13 tools. Is each one genuinely load-bearing, or just there because the models aren't reliable enough yet? The second kind is on a clock.

Design the decomposition contract before I split. The protocol between subagents is the real work. It runs fine today, so I've got time to do it right.

If you were in London too, or you work on agent harnesses, tell me how you handle the eval question.