I was in the room at Code w/ Claude London on May 19. I wrote up the live takeaways on LinkedIn the same week. This post is the slower version: the workshop notes that survived the week, the eval discipline I'm wiring into my own production stack, and the awkward question I kept circling back to — which of the patterns I just learned will Claude 6 make obsolete?
That question has a name. Rich Sutton called it The Bitter Lesson in 2019. It's a one-page essay and it's the most useful lens I've found for deciding what to invest in as an agent engineer in 2026.
This post is structured in three parts:
- Three workshop takeaways that stuck after a week
- The bigger pattern — eval-driven agent development
- The Bitter Lesson lens — what compounds, what gets eaten
The Event
Code w/ Claude is Anthropic's developer conference. The 2026 edition ran across three cities: San Francisco on May 6, London on May 19, and Tokyo on June 10. The London edition was organised around three main tracks: Research, Claude Platform, and Claude Code. Workshops, demos, breakouts, office hours with Anthropic engineers, and a Builder Stage running founder talks in parallel.
Opening keynote was led by Boris Cherny with the rest of the Anthropic platform team. Boris talked about his own builder roots before joining Anthropic — and that framing hit different sitting in the room as a software engineer who builds for a living. The whole event leans hard into "builders building" rather than "vendor presenting to prospects." That's the right shape for this stage of the agent ecosystem.
Three Takeaways That Stuck
Managed Agents — Session State as a Default
Isabella He's workshop on shipping your first Managed Agent was the first thing that made me rewrite a mental model.
The pattern I've been carrying: an agent has a context window, the context window evaporates between sessions, and I am responsible for persisting whatever should survive — task board, intermediate results, tool call history. That means a database. That means migrations. That means infrastructure I'd rather not own for a system that's still iterating weekly.
Managed Agents flip that. Anthropic positions managed agents as a developer-facing infrastructure layer for building and running AI agents, shifting workloads from isolated inference calls to coordinated, stateful workflows. Session state — messages, tool calls, results — is persisted server-side by Anthropic. On the client side, you hold a session ID. The full history replays via sessions.events.list().
What this changes in practice:
- No DB schema design for conversation state
- No backup/restore strategy for agent histories
- Full audit trail by default — every event is queryable
- The agent surface and the storage surface are the same thing
The trade-off Isabella was clear about: you're now coupled to Anthropic's persistence model. Across Claude Code, Claude Platform, Claude Managed Agents, MCP connectors, memory stores, routines, subagents, skills, managed settings, and telemetry, Anthropic is building more of the agent control plane around the model. If you're prototyping or iterating, that's a fair trade for the infrastructure you're not building. If you have hard requirements around state portability, you architect differently.
For me — building multi-tenant agent systems where the client-facing value is the agent's output, not its session ledger — the trade is fine. I'd rather spend that engineering time on tool design and evals.
Memory + Dreaming — A Filesystem and a Curator
Kevin Chan's workshop on Memory + Dreaming was the one I keep coming back to.
The mental model I walked in with: "memory" means stuffing relevant context into the prompt at the start of a session. RAG, basically, with a glossier name. Maybe a single memory.md file the agent reads on boot.
That's wrong on both axes. Anthropic's memory tool stores information through a memory file directory (/memory) that persists between sessions. The agent can create, read, update, and delete files in this memory file directory.
A memory store is a filesystem, not a file. Each entry has its own path (/projects/foo/notes.md), its own size budget, and its own edit history. The agent doesn't load the whole tree on boot — it greps for the keyword it needs, finds the right file, then reads only the matching chunk. Same mental model as Claude Code working on a local repo. Context window stays small even when the store gets big.
The second move is "Dreaming." This is an async curator subagent. While the foreground agent is doing work — or between sessions — a separate process walks the memory tree and reorganises it: consolidates duplicates, marks stale entries, enriches related facts, restructures the directory layout. The output isn't an in-place edit. It's a new memory store that you swap in on the next session.
The cleanest framing Kevin gave: memory isn't where the agent stores answers. It's where the agent stores lessons — and dreaming is the system that turns raw episodic notes into a curated knowledge base, on its own, on a cron.
The customer signal here is strong. Rakuten's long-running task agents use memory to avoid repeating past mistakes, reporting 97 percent fewer first-pass errors within workspace-scoped, observable boundaries. That's the order of magnitude that makes this worth engineering for, not against.
Agent Decomposition — The Hard Part Is The Interface
William Steuk's workshop on agent decomposition was deceptively simple on the slides and not at all simple in practice.
The pattern: when your orchestrator's tool catalog gets too large for the model to reason over cleanly — usually somewhere north of 15-20 tools, depending on overlap — you decompose. Group related tools into focused subagents with narrow toolsets. The orchestrator calls subagents instead of tools directly. Each subagent is small, focused, and has a tool surface it can actually navigate.
That part is obvious once you've felt it. The non-obvious part is what William flagged as the actual hard problem:
The hard part isn't picking the split. It's defining the interface between layers.
Once you have an orchestrator + subagents, you have a protocol. What does the orchestrator pass in? What does the subagent return? What format? What invariants does the orchestrator rely on? What happens when the subagent partially fails? If the messages between layers are ambiguous or lossy, the whole system degrades — and worse, the failure mode is hard to localise because it's distributed across the interface boundary.
This is the same lesson distributed systems people learned in the 2010s: service decomposition is mostly an interface design problem dressed up as an architecture problem. Agent decomposition inherits the same lesson — and the people who've shipped microservice systems have a real advantage here that the prompt-engineering generation doesn't.
I'm using this as a forcing function on my own multi-tenant SEO agent. Right now it has 13 in-process tools on a single orchestrator. Some of those naturally cluster: the GSC tools, the paid-channel audit tools, the content brief tools, the Webflow publishing tools. The split is easy. The interface contract between an orchestrator and a "content brief subagent" is the work I haven't done yet.
The Bigger Pattern: Hill Climbing on the Eval
The single most-quoted line from London, in my notes:
Hill climbing on the eval — climb on them.
— William Steuk (paraphrased)
This is the pattern that connected every other workshop. It came up in the decomposition session, it came up in the managed agents session, and it was the entire content of the eval-driven agent development workshop.
The frame: your eval suite is a landscape. Each prompt change, model swap, tool tweak, or architectural decision is a step. You don't solve agent quality in one shot. You pick the next step that moves the score upward, and you repeat. The eval gives you the gradient. Without it, you're not engineering — you're vibing with extra steps.
A few things from the eval-driven workshop that survived the week:
LLMs are bad at defining their own evals. Don't ask the model to write its own rubric. It will under-specify, hand-wave, and grade leniently on its own output. The model can apply a rubric (LLM-as-judge is fine, with caveats). It cannot author one that's worth optimising against. The rubric has to come from a human who has taste in the domain.
Defining "good" is the actual work. The line that landed for me was something like: "sometimes you watch a movie and you know it's bad, but you can't articulate why." The gap between quality felt and quality articulated is the entire job of writing an eval rubric. The score won't be useful until someone in the room can name the failure modes in language. That's not engineering. That's knowledge elicitation. Most engineers find it uncomfortable because it isn't code.
Weekly insights on AI Architecture. No spam.
Where to source rubrics when you're not the expert. Scrape places where domain experts already articulate quality in language — subreddit critiques, product reviews, expert blogs, code review threads, design critiques. They've already done the work of converting tacit quality into criteria. You can mine that to seed your judge prompts. This was the single most actionable tip from the workshop.
Two kinds of graders. Split your evals into:
- Code graders — deterministic, no LLM calls, instant, free. For my SEO agent: did it produce an output? Is the word count in range? Did it use the target keyword in the H1? Are internal links count-correct? These are TypeScript over parsed output. They run on every PR.
- Judge graders — LLM-as-judge calls, structured 0–5 scores via schema. For my SEO agent: does the intro hook? Is the tone consistent with the brand voice examples? Does the H1 match the brief intent? These need a model that can read.
Both go into one scorecard. Heat-mapped red/yellow/green. Deltas versus a saved baseline. You look at the scorecard before and after every change. That's it. That's the loop.
You can run multiple judge dimensions in one batched vision call. A small efficiency note that matters at scale: don't fire one LLM-as-judge call per dimension. Memoize a single batched call that returns all dimensions at once. Four judges, one batch.
Optional next step: close the loop. The workshop's reference setup is one-shot — agent generates, graders grade, scorecard prints, human iterates. You can wire grader scores back to the agent (feed the failing rows into a second session and ask it to retry against them). The reference doesn't do this, but it's the natural next step once your eval set is mature enough that automated iteration won't drift into local minima.
I left London with this as my single biggest debt. I don't have proper evals on my multi-tenant SEO agent. I have intuition built from shipping it for months, and intuition is fine until it isn't. Without evals, I can't tell you if swapping models made things better or worse. I can't tell you if a prompt tweak is actually an improvement or just a different vibe. I'm flying by feel on a system that runs weekly across an agency's client roster, and the moment I admit that out loud it becomes obvious that it has to change this quarter.
The Bitter Lesson Lens
Now the awkward part.
The Bitter Lesson is two pages. Rich Sutton wrote it in 2019. The thesis is simple and uncomfortable:
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.
The pattern Sutton documents: every generation of AI researchers tries to be clever by encoding their understanding of how the problem "should" work — handcrafted chess heuristics, phoneme-based speech models, SIFT features for vision. And every generation gets beaten, eventually, by simpler methods that scale compute through search and learning. The cleverness plateaus. The compute keeps scaling. The bitter part is that researchers want their cleverness to matter, and it doesn't.
Translate this to agent building in 2026:
The harness keeps shrinking because the model keeps absorbing it. Look at the last four years:
- 2022: prompt engineering as a craft — magic phrases, role-playing, "let's think step by step." Mostly gone. Models reason natively now.
- 2023: hand-crafted output parsers, regex coaxing JSON out of completions. Gone. Structured outputs and function calling are native.
- 2024: elaborate RAG pipelines with bespoke chunking and reranking. Shrinking. Long context plus native retrieval is eating most of it.
- 2025: hand-managed memory, custom multi-agent orchestration, hand-built verification loops. Currently shrinking — managed agents, memory + dreaming, native multi-step tool use are exactly the Bitter Lesson playing out in real time.
Every layer of human cleverness in the harness eventually gets eaten. That's not a failure of harness work. It was useful while it was needed. It's just the trajectory.
So what do you actually invest in?
What compounds across model generations (build deeply):
- Evals and rubric design. Models get better, but the question "did my system improve on my domain?" never goes away. The rubric is the part that survives every model swap.
- General-purpose tools. A bash tool plus code execution plus web access can do almost anything. The model uses them better as it gets smarter. Fifteen narrow specialised tools is harness work that gets absorbed.
- Production substrate. Sandboxes, capability boundaries, audit trails, distributed locks, identity, permissions, observability. These aren't cognition. They're systems engineering. They don't get eaten because they're not what the model does — they're what surrounds it in production.
- Domain knowledge and integration surfaces. Knowing what a specific CRM actually does in production, what a Webflow CMS will and won't accept, what your client's industry values — the model doesn't have this for your specific verticals. Stays valuable.
- Taste and problem selection. The skill of looking at a workflow and knowing which parts to automate, which to keep human, where the leverage actually is. The most durable thing on the list.
What gets eaten (use, but architect to throw away):
- Most prompt engineering beyond a clear, minimal system prompt
- Hand-crafted memory schemes (Memory + Dreaming is the writing on the wall)
- Bespoke multi-agent orchestration patterns (managed agents are coming for this)
- Custom verification harnesses for things models will eventually self-verify
- Narrow specialised tools where one general tool would do
The one-line operational version: encode your taste as evals, not as rules. Rules constrain the model. Rules get eaten when the model gets smarter. Evals let the model improve against the thing you actually care about. Evals compound.
This is the move that connects Sutton's 2019 essay to William's "hill climb on the eval" in 2026. Evals aren't just measurement infrastructure. They're the durable scaffolding around an increasingly capable model. You're not building cleverness anymore. You're building the gradient signal that lets the cleverness happen on its own.
What I'm Changing In My Own Stack
Three concrete things from London that are landing in my work over the next month:
- Real evals for the multi-tenant SEO agent. Start embarrassingly small: pick 10–20 inputs by hand, define what "good" output looks like for each, run it before and after every meaningful change. Layer in code graders and judge graders as the rubric matures. Stop hill climbing on vibes.
- Audit the 13 in-process tools. For each one, ask: is this here because it's load-bearing in production (deterministic safety, idempotency, audit trail), or because today's models aren't quite reliable enough to be trusted with a more general primitive? Anything in the second bucket is on a 12–18 month clock.
- Define interface contracts before splitting the orchestrator. If I decompose the SEO agent into a content-brief subagent, an audit subagent, and a publishing subagent, the protocol between them is where the work is. Designing that contract is more important than picking which tools go where.
Closing
The Bitter Lesson is uncomfortable because it tells you that most of the clever harness work you're doing right now will be obsolete in 18 months. It is also liberating because it tells you exactly where to spend your time: on the parts that compound across model generations. Evals. Substrate. Domain knowledge. Taste.
Code w/ Claude London was the first event where I saw the whole industry quietly aligning around this. Managed Agents is Sutton's lesson applied to session state. Memory + Dreaming is Sutton's lesson applied to long-running context. Eval-driven development is Sutton's lesson applied to how you actually iterate.
The takeaway isn't "use the new Anthropic features." The takeaway is: build the discipline that survives every iteration of the features.
Thanks to Anthropic for the invite, to Boris, Isabella, Kevin, and William for the workshops worth taking notes during.
If you were also at London or you're working on agent harnesses in production, I'd be curious how you're handling the eval discipline question — drop me a line.
Further Reading
- The Bitter Lesson — Rich Sutton, 2019. Two pages. Read it tonight.
- Code w/ Claude — the conference itself, recordings going up on Anthropic's YouTube channel.
- Anthropic Managed Agents — the platform layer the London workshops were built on top of.
- My LinkedIn recap of Code w/ Claude London — the shorter live version of these notes.
