JoT

The Spam Economy Comes to Work

Jeff — Mon, 01 Jun 2026 05:49:50 GMT

Everyone in your company has an AI model. Productivity isn’t going up, but there’s a load of documents flying around that imply it should be (many of those written by AI). And for each of those documents there’s a load-bearing triplicate of reasons on how those documents could be improved.

You’ve been AI-DDOSED.

Chat GPT says this conveys the spirit of this article.

Why does it happen?

There’s plenty of reasons why people feel the need to spam gibberish AI. Speaking to friends and family shows me this is happening everywhere.

For some, the audience matters more than the work (see social media). Some of these are about looking good and some about not looking responsible if things break.

People like to karma-farm. An AI can turn any gibberish into something important and profound sounding.1
People want to help (and seem to forget you’ve got a subscription too).
People want plausible deniability. “The AI suggested this approach” puts the blame elsewhere.
People confuse output with progress. Generating a robust strategy sure beats doing any work.

For others there’s genuine confusion about the tool and see it as an all-seeing genius and then trusting it to do things it can’t and taking its agreeable output as evidence.

People mistake fluency for thinking. AI is an expert in everything you are not.
People want validation. The model will tell you it’s a good draft. People will not.
People genuinely believe it helped. The first 80% takes seconds, the remaining 20% takes forever.

There’s also the additive fallacy (or more bluntly, “more is more”).

People reward effort by length. A multi-page detailed analysis sounds better than “no”.
People are anchored to “comprehensive.” The AI offers seven considerations and guess what? You can always ask it for more.
People focus on getting their work done, without considering the impact on others.
People externalise the cost. Producing the document took five minutes. Reading it costs the company forty.
People can’t tell what’s signal (or don’t want to make the effort). AI might generate something with some insight, but it’s buried in noise.

In some places, AI is an expected default. It’s not just about keeping pace with your peers, it’s about embracing a fundamental shift in how we approach communication, collaboration, and the very nature of professional discourse itself.

People are paying for it. The company bought five seats per person and three pilot programmes that never ended. Something must visibly be happening, or the budget vanishes at the next review.
People imitate their peers. The VP started sending AI-generated weekly updates. Now everyone does.

What to do instead?

Framing this as a systems problem is easy. It’s a textbook externality, identical to the email spam economy. Sending costs nothing, receiving costs the world. Every defence we have (sender reputation, rate limits, filters) exists to claw back some of that asymmetry.

The fix is always the same. Internalise the externality. Make producers pay the reading cost, cap inventory on shared channels, make consumption visible (etc). All require discipline, which means they’ll fail. Try an incentive system instead and Goodhart eats you alive: measure seats, prompts run, or documents produced, and you’ll get more seats, prompts and documents.

Do you actually have a solution?

Nope.

Remember when digital camcorders were all the rage, and you had to suffer watching someone else’s badly recorded holiday movies? YouTube and Instagram came along and built filtering and reputation on top. We need that for office documents, at least until the likes themselves become the goal and the platform falls apart.

In the meantime, the best defence is asking people about the work “they’ve” produced.

It needs to be socially unacceptable to ship slop2.

Daniel Dennett calls these “deepities” (https://philosophytalk.org/blog/deepities-and-bullshit/)

I used to have sympathy. I’d see LLM use and assume it was people using an LLM to feel more confident in their writing. I was wrong. Maybe a subset of folks do this, but it’s dwarfed by the amount of slop produced.

What did I learn at Code w/Claude London?

Jeff — Mon, 25 May 2026 06:21:50 GMT

I was lucky enough to attend Code w/ Claude in London. It didn’t go hugely deep I’m afraid, so I’ve mostly got tid-bits of information that I found useful. So what did I learn?

Every post needs an image.

We’re on an exponential

I think almost every speaker from Anthropic said “we’re on an exponential”. Models are improving at an exponential rate. Unattended task length is improving at an exponential rate. And so on. This was often paired with “AI adoption is linear”. Put the two together and you’ve got an implementation gap, and that’s where we (armed with Claude Code) come into play.

Where does that exponential take us? Well, we can expect better judgement, near infinite context windows and multi-agent co-ordination. We’ve gone from completion to agentic execution. The next step will be continuous, proactive AI with always-on agents. Prompts will move from descriptions of what to do, to the outcomes we need to achieve (as general as “keep the project on track”.

Designing apps

Three guiding policies.

Build for the next model - as models get smarter, previous generations of scaffolding often hold it back. The direction of travel is more general tools, rather than highly specialize ones.
Curate frontier evals - You need to understand how your system performs with models (this was a common theme(
Treat model upgrades as a feature

What model should you use?

You should use the model that performs best on your eval. You don’t have evals? Use the defaults for the best results.

The /advisor mode was recommended as a way to achieve near frontier performance at five times less cost. It uses the Sonnet model by default and escalates to Opus when needed.

Some Claude Code Tips and Tricks

/remote-control - Control Claude Code sessions from your phone so you don’t have to walk around with your laptop open.

/tui fullscreen - Flicker-free rendering (rendering is virtualized, no more horrible flicker)

claude agents - Manage a team of agents. Anything you type spawns off an agent and you can jump in/out of them anytime. Very nice!

/routine - Run remote agents to respond to events (or just a cron job). For example, sweep through the tech debt backlog, update code-reviews and so on.

/ultrareview - Launch a multi-agent review of your pull request. Used by Anthropic on every PR.

claude --worktree - Claude can use git worktrees to allow multiple agents

Some new things

Alas, there was no new model for us to play with.

Self-hosted sandboxes do what they say on the tin! I can see this being super useful for bringing agentic workflows in house when you’re worried about your data leaving your infrastructure.

MCP tunnels solve the problem of securely connecting Claude to MCP servers running in your network (without opening your MCP server to the internet!).

How do you run an AI-native org?

By far my favourite talk because some of the opinions were the opposite of what I thought.

The bottlenecks have moved. The old bottlenecks were code, tests and refactoring. AI can do that now. The new bottlenecks are verification, review, security and XFN (I’m using the acronym to appear hip, I believe it’s x-functional).

So far, so standard. But what I found interesting was how the norms are changing at Anthropic.

There’s less planning. You might think that the emphasis should be there (after all, giving AI a good spec feels like it should be a good idea),
Onboarding is easier. The cost of answering questions is zero; AI can answer questions about code.
Technical debates become show don’t tell. No more arguing about design A vs. design B when you can just implement. In Technical Debates, code wins.
Less design doc rituals. No theatre of writing down X, just build it instead.
The source of truth is the codebase. All documentation in the codebase.

Code review is the problem we all thought it would be. Anthropic use Claude for style, obvious bugs, spec drift and so on, but humans where there’s legal concerns, high-risk, trust-boundaries or just product taste.

What does the team make up of the future look like? A combination of creative builders with product sense, and others with deep system expertise for the hard parts. All managers at Anthropic are ICs first, contributing code. And everyone dog-foods whatever you are building.

How’d you roll out this?

Claude-ify all your workflows - this frees up humans for the harder stuff
Tell people they can delete processes

You’ll be winning if onboarding is quicker, PR cycle time is faster and the number of Claude-assisted commits is going up.

Ranganathan’s Five Laws Apply to Your AI

Jeff — Mon, 18 May 2026 05:45:52 GMT

Your AI agent is only as good as its library.

The library is the (often messy) collection of prompts, documents, tools, memories, policies, APIs, search results, text files and previous conversations from which the harness tries to assemble enough understanding to act.

S. R. Ranganathan saw this shape of problem of organizing information nearly a century ago. In 1931, he published The Five Laws of Library Science:

Books are for use.
Every reader his book.
Every book its reader.
Save the time of the reader.
The library is a growing organism.

They sound like aphorisms, but they are closer to a design specification for information access. They also describe the design challenges of modern AI context stacks surprisingly well.

Books are for use

The book version: a library is not good because it owns books. It is good because readers can use them.

The context version: an AI system is not good because it has capabilities like RAG, crafted prompts and specific tools. Those things only matter if they improve the work the model does.

A retrieval system that returns seventeen vaguely relevant chunks may score well when tested individually and still make the model worse. A memory system that surfaces every past conversation about a customer may technically succeed while burying the one fact that matters. A tool description may be accurate and still encourage the model to call it at the wrong moment.

The question is not “did the context exist?”, it’s instead “did the agent do better work because of it?”.

Every reader his book

The book version: different readers need different books. A physicist, a journalist and a schoolchild asking about quantum mechanics should not all be sent to the same shelf.

The context version: different models, users and workflows need different context.

A coding assistant, support bot, sales copilot, security analyst and compliance reviewer should not receive the same context package. They are doing different jobs, and have different risk profiles.

A coding assistant needs interfaces, invariants and recent changes. A support bot needs entitlement, known issues and escalation rules. A compliance reviewer needs policy, exceptions and audit trail.

We often expect one agentic system to serve multiple needs and rely on the model to do the filtering. We give them the same pile of books (the same tool list, the same memory strategy, the same retrieval and so on).

That works while the library is tiny. It fails once the system grows.

An AI system with access to twenty MCP servers does not need twenty server descriptions on every task. It needs routing. Which tool is relevant? Which source of truth applies? Which memory matters? Which policy constrains action?

A passive list of available contexts is not enough. AI systems need context routing by user, task, intent and risk.

Every book its reader

The book version: a book that exists but cannot be found has effectively ceased to exist. Cataloguing is not administration. It is how the book finds the reader who needs it.

The context version: useful information inside your organisation is worthless to the model if the model cannot discover it at the point of need.

The runbook is in a wiki. The policy exception is in a Slack thread. The customer-specific behaviour is in a support ticket. The design rationale is in an ADR. The safe way to perform an operation is buried in a tool description written for platform engineers.

The information exists, but the model cannot find it.

This is mostly a vocabulary problem. The user asks about “cancelling an account”. The system knows about “subscription termination”, “offboarding”, “churn workflow” and “account closure”. They are the same thing. The model sees four separate shelves and picks the closest one, which may not be the right one.

In a human organisation, this gap is bridged by the person who has been around long enough to know that “the new auth flow” in Slack, “JWT validation layer” in the codebase, and “identity boundary” in the design doc are all the same things. That person is not on call when your agent is running at 2am (or they are and they are exceedingly grumpy).

The catalogue needs to do that work instead. Synonyms, ownership, source-of-truth markers, provenance and most importantly the connections between them.

Save the time of the reader

The book version: the reader’s time is precious. A library fails if the right book exists but takes an hour to find.

The context version: save the model’s tokens.

If you work with Claude Code without a tight AGENTS.md it’s easy to see this in action. Any query will spawn tool calls to find that information resulting in huge numbers of tokens being consumed.

Tool descriptions show the same pattern. A delete_record tool ships with the internal API doc (parameter types, error codes, retry semantics, the lot!). But what the model actually needs to know (“this is destructive and irreversible; prefer archive_record unless the user explicitly asks for deletion”) is nowhere in the description.

And it’s not just about cost. Every unnecessary token consumes context window. Every bloated description makes the model do secondary work before the real work.

The model may technically have the answer and still fail because the answer is buried under sludge. Correct but bloated context is still bad context.

If the tool, RAG system or memory system is returning more than it needs then it risks creating distraction. The test for this is simple: remove the context. If the model doesn’t give worse answers, it was waste.

The library is a growing organism

The book version: a library is never finished. Its collection grows. Its readers change. Its classification scheme must adapt to subjects the original cataloguer could not predict.

The context version: your AI system is never finished either.

The prompts you write today encode assumptions about how your tools work, what your data looks like, and what your users need. Those assumptions will become false. APIs change. Products pivot. New edge cases emerge that nobody anticipated. A retrieval strategy tuned for fifty documents quietly degrades when the corpus reaches five thousand.

The harder problem is that the decay is invisible. The model doesn’t throw an error when it acts on a stale policy. It just gives a confidently wrong answer.

In a real library, the cataloguer has a job. They weed the collection, update classifications, retire obsolete material, and decide what the library needs to acquire next. AI context systems rarely have an equivalent. The system prompt was written once. The tool descriptions shipped with the feature and were never revisited, and the memory layer accumulates without pruning.

This is why context systems need ownership and instrumentation. Someone (or something) has to notice when the vocabulary has drifted or when a tool description no longer matches the tool. That is not a one-time task. It is the ongoing work of tending the library.

The point (if there is one)

Context systems are hard. System performance is a function of model, prompt and library.

The library is currently a hodgepodge of individual technologies (RAG, tools, MCP, memory, conversation history, prompts, skills, connectors and I’m sure many more) with no coherent organization.

These can all be improved individually, but they are just individual books of information in a larger system. The real opportunity is to design the library1.

That means cataloguing, routing, vocabulary control, provenance, freshness, pruning, circulation policy and ownership of the whole experience by which an AI system finds what it needs to act.

When an AI system retrieves the wrong document, invokes the wrong tool, ignores the policy, uses stale information or fills the context window with noise, that is not (always) a model failure, it’s often a library failure.

The counterargument to this would be hey, models are SO smart now that they can find the needle in the haystack. Even if that becomes true, the curation of the library still feels important in order to get other aspects of performance (cost, time to answer).

Don't outsource the struggle

Jeff — Mon, 11 May 2026 06:06:12 GMT

The worst way to use AI is to ask it to do the work.

The problem isn’t because the output will be bad. Increasingly it’s pretty good, and that’s the problem.

If the output was obviously broken, then you’d notice, fix it and learn something along the way. But if it’s good enough, then you haven’t learned a thing.

Obligatory image because posts look odd without one…

Desirable Difficulties

In the 1990s, Robert Bjork ran a series of studies that upended how psychologists thought about practice. He found that the conditions that feel most productive (easy repetition, fluent recall) often produce the weakest retention. Difficulty was not the enemy of learning. Certain kinds of difficulty were the mechanism, and he called them desirable difficulties.

Difficulty that forces retrieval or that makes you generate an answer before seeing it is desirable difficulty and builds durable knowledge. Difficulty that comes from unclear instructions or a wasted search? That’s just a waste of time.

Which kind of difficulty does AI remove?

Friction

I’ve written before about friction. There are different types of friction - which of them are desirable difficulties.

Mechanical friction

This is the boring tedious stuff. Remembering the syntax for initializing a multi-dimensional array that you rarely use. Remembering the API incantation. The glue code that needs to be written. None of these things really teach you anything new1.

This isn’t a desirable difficulty. Let the LLM do the boring work.

Navigational friction

Finding your way around a codebase can be a pain in the ass. A modern IDE will allow you to navigate to the interface easily, but then you’ll hop through implementations until you find the right one, then step back and go into the dependency injection mush to ensure you’ve found the right file and eventually you’ll find that one example to know how the pattern works.

This one is a desirable difficulty. Once you understand how the system is wired you’ve got fantastic context for the codebase.

Use the LLM as a guide. Don’t just let the LLM go and make the edits, ask it to give you context. For example, “help me trace where this behaviour lives. Show me the files involved and explain the flow before we change the code”.

Comprehension friction

Understanding what the code does is the core job of software engineering.

This is where learning starts. If you skip it, you get familiarity with the output without understanding of the system. That is the engineering equivalent of being able to recite a proof without knowing why any of the step’s work.

Failing to do this, will leave you becoming a cargo-cult engineer.

Don’t outsource the comprehension struggle.

Use the LLM to validate this understanding. For example, give the LLM your own understanding and have it critiqued. Or even ask for the LLM to quiz you on a particular question (“Test my knowledge of the Foo subsystem by asking me questions”).

Design friction

Choosing an approach. Rejecting the alternatives. Seeing trade-offs. Noticing what each option makes harder three months from now.

Amplify this. This is the biggest learning opportunity. And if you skip it, well see Nineteen Features, Zero Architecture!

You need trade-offs here. Don’t ask the LLM to just produce a design. Instead ask for multiple designs and work through the trade-offs yourself. For each, understand what it makes easier or harder and most importantly make a choice yourself!

Verification friction

Code has been written, but does it work? There are two questions here. Does the code do what it’s supposed to do? And does what it’s supposed to do match what stakeholders actually need?

An LLM can help with the first. It can stress-test your assumptions and prove behaviour with unit tests. The trap: if the LLM writes the code and writes the tests, you’ve automated both sides of the verification loop. Bainbridge flagged this already: humans who only supervise automation lose the very skills they need when the automation eventually fails.

An LLM can’t help with the second. You need a mental model not just of the architecture, but of the why. Software is about solving problems for people. Don’t outsource that.

Articulation friction

Writing the PR description. Explaining your reasoning. Naming what you are not sure about.

This is how private thinking becomes visible. The act of articulation is part of the work. Offloading it entirely means you never find out whether your understanding is as solid as you assumed.

Use the LLM to sharpen your thinking, not to do it for you.

Accountability friction

If you’re using an LLM you must own the result and be able to defend it. AI must never be used to remove this friction.

If you are responding to PR comments automatically with AI, then you need to own what it says.

Accountability is the ultimate desirable difficulty!

The implementation-monkey failure mode

There is a version of AI-assisted engineering that feels like productivity but produces no learning.

The engineer receives a ticket. They paste it into a chat window. They accept the diff. They patch the errors the linter finds. They ship the code. The ticket closes. Win!?

However, in this version the engineer is just a passenger for the journey. They haven’t built any additional capability and won’t grow as an engineer2.

Driver vs. Passenger

I believe there are two different approaches to using AI.

In the passenger mode, you give the AI tasks, and it gives you artifacts. In this model, the AI is doing all the work, and you’re just hitching along.

In the driver mode, you use AI to understand the terrain, generate options and test your assumptions. In driver mode, you are controlling the shape of the work, and you’re using AI to compress the parts that do not build judgement and amplifying the bits that do.

Be a driver!

What this means for organisations

It’s a system! Banning AI or policing prompts won’t produce the right results. You’ve got to make the journey mode visible and make the passenger mode uncomfortable.

What might that look like?

Plans before implementation - ask an engineer to explain what they are building before they build it.
PR checklists should include options - make sure that significant pieces of work have the trade-offs behind choices laid out and explained.
What did you learn - PR reviews don’t have to be solely about correctness; the mentorship aspect is also important (Thoughtworks described reviews as the bundling of mentorship, consistency, correctness and trust).
Prompt examples in team documentation - Show the difference between prompts that directly produce artifacts versus those that produce understanding.

Conclusion

The goal is not to make work frictionless.

The goal is to remove the friction that wastes attention and preserve the friction that builds judgement. Bjork’s desirable difficulties are still desirable. The undesirable ones are still worth eliminating. The job is to know the difference.

Use AI, but be a driver, not a passenger.

I used to think it was useful to know all this kind of syntax, but since it’s now easily lookup-able, I don’t think there’s much value in it.

I suppose they will be super good at LLM usage though!

Bad code is contagious

Jeff — Mon, 04 May 2026 06:54:53 GMT

Bad code has always spread through codebases by imitation. AI just turned a slow cultural process into an epidemic.

When a new developer joins a project, they learn what code should look like by copying what’s already there. Style guides are usually out of date or aspirational. ADRs capture decisions but not necessarily reality. The only reliable signal is the code itself, and “when in Rome” is a sensible default.

The problem is that the default propagates whatever is already there, good or bad. Point an imitation machine at a corpus of imperfect code and you get propagation and compounding. Just like a virus.

Why code conventions spread like a virus

In small teams, conventions spread deliberately. Someone makes a call, writes it down1, mentions it in review, and the team mostly follows.

At scale, that breaks down. Most developers learn conventions less from documentation than from neighbouring code. The local environment becomes the fitness landscape, and “when in Rome” becomes the replication mechanism.

AI agents just accelerate this process.

A short primer on SIR

The SIR model divides a population into three states: Susceptible, Infected, Recovered. Within the model there are three important numbers.

Transmission probability - when an infected individual contacts a susceptible one, this is the chance the pathogen passes across
Contact rate - how often infected individuals contact susceptibles in the first place.
Recovery rate - how quickly infected individuals stop being infected.

The product of these (adjusted for the average number of contacts per individual) gives you R0, the basic reproduction number. It is the number of new infections caused, on average, by a single infected individual in a fully susceptible population. As you will remember from the COVID era, if R0 is below one2, the outbreak dies out. If R0 is above one, it grows.

So, lets stretch this model across to software.

Transmission probability is now how convincing a bad pattern looks when an agent reads it. A well-named, well-structured and well-commented bad pattern transmits at a much higher rate than a scruffy one. This is part of why AI-generated code is so dangerous, it looks just as (if not more so) plausible than human-written code.
Contact rate is how aggressively agent sessions sample neighbouring files. Big context windows and tools that pull in all relevant files, and prompts that say things like “follow the existing patterns in this repo” all increase the contact rate.
Recovery rate is how quickly you refactor, review or remove the pattern. This is entirely a function of your engineering practices. You get to control this one!

The R0 threshold

For the first example, let’s pretend we have the architecture below. There are roughly 120 components. There are some foundational services in the middle that almost everything else uses, but most of the classes only use a handful of other classes.

Architecture? Nah, it’s a randomly generated network https://en.wikipedia.org/wiki/Barabási–Albert_model

We will try to simulate three different approaches. We will imagine the virus to be AI-generated code and simulate three different styles of usage.

Simulating pattern spread across architecture

Conservative - In this example, think mostly using AI within files with tight prompts and narrow context. In this usage of AI, code patterns spread slow enough that recovery can keep pace. R0 in this case is around 1.
Default - Now we are imagining a more typical AI harness with no special constraints. AI pulls in patterns as needed through the harness and uses much of the context window. R0 is about five in this case. The pattern does not burn through every component, but it is sticky enough that you’ve now got roughly a quarter of your components carrying a wrong pattern, and it is hard to remove since it has prior art on its side.
Aggressive - wide context windows, tools that eagerly sample neighbours, prompts that tell the agent to follow existing patterns can push R0 up to fifteen. This is bad. By day thirteen, the contaminated pattern is the dominant one.

The important thing about R0 is that it behaves like a threshold, not a linear dial. Below one, the pattern struggles to sustain itself. Above one, it starts finding chains of transmission. The difference between 0.9 and 1.1 is not a rounding error. It is the difference between local noise and systemic spread.

Vaccinating your code base.

With the industry drive being to AI all the things, R0 is mostly out of collective hands. You might be able to constrain agent configuration at the margin, but the industry trajectory is towards AI all the things. Now the question becomes, what can you do about the population itself?

The epidemiological answer is vaccination. You do not need to protect everyone, but you need to protect enough of the population that the average infected individual encounters fewer than one susceptible contact. If you can achieve that, then the effective R drops below one and the outbreak dies even though R0 in the unprotected subpopulation is still high. This is herd immunity, and it is one of the most counter-intuitive and important results in the field.

The software analogue to a vaccine is architectural. A vaccinated component is one where the bad pattern cannot take hold. A vaccine could be a lint-rule, an automated test, a schema contract or even a golden-path scaffold that wins because the agent reads it first. Vaccinated services are immune not because they are better written, but because the environment around them actively refuses the contaminated pattern.

In the next chart, we will use the Default scenario and now vary which fraction of services have vaccination applied.

Simulating what happens with different levels of vaccination.

You might be thinking, surely if I write the correct rule, I can vaccinate 100%? Well, if that were true then you’d be able to write an oracle to say if something was good or bad and I think that’s too hard for me! I prefer to look at it from the other direction. If I can write a fuzzy rule that rejects some percentage of bad code, then the pattern stops spreading.

At zero coverage the outbreak peaks around thirty services. At thirty percent coverage the peak collapses by three-quarters to about seven. At sixty percent it is essentially extinguished - you get small, transient flare-ups that recover on their own. At eighty-five percent the pattern cannot get going at all.

This matters for engineering decisions because the intuitive model says, “we will roll out linting progressively, starting with the most critical services”. The epidemiological model says something different: the value of linting is strongly non-linear in coverage, and you do not get real protection until you are above the herd-immunity threshold. A thirty percent rollout does not deliver thirty percent of the benefit. In this simulation, it removes roughly seventy-five percent of the peak outbreak, because it breaks transmission chains

Vaccinate your codebase!

If this analogy holds, then the interventions we should make to stop bad patterns spreading are:

Golden paths are vaccination - Every component that uses a golden path is a vaccinated member of the population. Golden paths will get copied by the AI. Herd immunity appears when enough of the codebase uses them.
Linting rules / Architectural fitness functions are barriers at the point of contact (masks, hand-washing etc.) - They do not prevent a bad pattern being generated, but instead they catch it before it reaches a service. A good lint rule is one that blocks the wrong thing from being committed. It also does not have to be 100% effective.
Agent scope is the contact rate - An agent session that touches many files is a house-party! Smaller, more well-defined tasks aren’t just about code quality, they reduce R0 directly.
Refactoring is treatment - This is the recovery rate. It is the one parameter you can raise directly and it’s the one most often neglected. “We will refactor when we have time” is the epidemiological equivalent of “we will treat the sick when we have time”. You can raise recovery rate by funding platform teams, protecting time for it or by making it cheap enough to happen inside ordinary work.
Observability is surveillance - You can’t respond to an outbreak you cannot see. Tools that scan for pattern proliferation (duplicated logic, drift from golden paths) are the codebase equivalent of contact tracing.

Engineering leaders spend a lot of time thinking about how to produce more code. The more useful question is how to ensure that the code you produce does not contaminate the code you already had. Epidemiology has been thinking about that problem for two hundred years. The intervention that matters is raising the recovery rate and vaccinating enough of the population that bad patterns cannot find a chain to run through.

No-one really writes it down, but we all like to pretend that we do.

During COVID almost every night there would be a reporter trying to explain R0!

Estimating AI's carbon footprint.

Jeff — Mon, 27 Apr 2026 05:54:47 GMT

The environmental impact of AI is huge. There’s an insatiable quest for power (I mean electricity, but you know, other types of power too). On the flipside, it’s clear that there are some use-cases (particularly software development) where there are benefits.

How do you quantify the impact of your usage of AI with your carbon footprint?

Neither Anthropic nor OpenAI publishes concrete information about their environmental impact. In this post, I’ll try to estimate some bounds.

First off, a disclaimer. I’m an arm-chair researcher writing things up based on publicly available information. Don’t take this as gospel; it’s an estimate and if I’ve made incorrect assumptions, tell me!

Tommaso.sansone91, CC0, via Wikimedia Commons

Everyone has a number

Sam Altman says an average ChatGPT query uses about 0.34 watt-hours. Google’s August 2025 paper on Gemini puts the median text prompt at 0.24 Wh.

The trouble is that any single number describes a population. A short factual question on a non-reasoning model, served from a hyper-scaler on a clean grid, sits at one end. An ARC-AGI puzzle solved by frontier model in high-compute mode sits at the other.

Once you fix the model and infrastructure, the dominant variable you actually control is the number of tokens the system burns.

The formula.

There is a simple way to think about the cost of an AI query:

How much computation did the model do, how much electricity did that require, and how dirty was the electricity?

For a transformer, generating one token takes roughly two floating-point operations per active parameter (one multiply and one add) per active parameter.

So, the first-order compute cost is:

Where:

N_active is the number of active parameters used per token.
T_effective is the number of tokens the system actually burned.

T_effective is the part doing a lot of work. It’s not just the number of tokens you see on the screen. It includes the prompt, the answer, the hidden reasoning tokens, any agent loops, retries, tool calls and so on. A normal chat request might be a few hundred or a few thousand tokens, but a hard reasoning task could be many multiples of that.

That’s why thinking about as “what does a query cost” is the wrong question. A query isn’t a unit. Tokens are a bit closer, but even then, that hides the batching, hardware and datacentre effects.

To turn compute into energy, we need to account for how efficiently the GPU is actually being used. This looks more complicated than it is!

The fraction gives us seconds of GPU time (total FLOPS needed divided by speed of GPU). The remaining terms turn seconds into joules.

In plain English:

P_peak is the theoretical speed of the GPU.
e_serve is the realised serving efficiency (how much of that peak bandwidth is used)
P_GPU is the GPU’s power draw.
α is the fraction of the GPU’s rated power actually used.
PUE is the datacentre overhead. If the servers use 1 kWh and the building uses another 0.2 kWh for cooling and power conversion, the PUE is 1.2.
η_node adds the rest of the server: CPU, memory, networking, power losses.

And now to get an estimate of the CO2 equivalent.

The 3,600,000 is just the conversion from joules to kilowatt-hours. CI is the carbon intensity of the electricity grid, measured in grams of CO₂ per kWh.

What are some plausible numbers?

Let’s see if we can get anywhere near the numbers from Google and Sam Altman.

Let’s start making assumption. I’ll use an H100-ish GPU as the reference point because that was probably dominant at the time of those comments, and it’s a good baseline.

Active model size: 100B parameters
Accelerator peak throughput: 3.96 PFLOP/s
Realised serving efficiency: 5%
GPU rated power: 700W
Actual power draw: 60% of rated power
Node overhead: 1.3×
Datacentre PUE: 1.1

5% efficiency sounds absolutely mad right? LLM serving is beyond just inference, there’s prefill, decoding, KV cache, memory bandwidth, batching and so on. NVIDIA published a benchmark for Llama 2 70B on 8 H100 GPUs. The server processed about 5 inferences per second, and that works out to about 5% of the peak throughput1,

Plugging in all of those numbers into our formula gives this.

And if we simplify this (and divide by 3.6M to get kilo-watt hours. We get.

So, under the assumptions I’ve made, every 1000 tokens costs about 0.17Wh. Which feels close enough to the numbers cited by Altman and Google. If I work backwards from their output, it’s 2000 effective tokens for ChatGPT and 1400 for Google and that sounds a plausible number of tokens to me!2

Translating into carbon footprint.

If we’ve got the Watt hours, we now have to multiply the Watt hours by the carbon intensity.

If you are using gas turbines to power your LLM then that’s one end of the spectrum (somewhere between 500-600 grams of CO2 per kilowatt hours, see here). In the UK, the carbon intensity was 125g CO2/Kwh (see 2024 energy review). And in Norway, it’s super clean at 2.2g CO2/KWh.

If we plug that into our formula we get:

I’ll use gCO₂e rather than just gCO₂. The “e” means equivalent: other greenhouse gases, especially methane, converted into the warming impact of CO₂.

So, a 2,000 effective-token query is:

at 550 g/kWh: about 0.19 gCO₂e
at 125 g/kWh: about 0.04 gCO₂e
at 30 g/kWh: about 0.01 gCO₂e

These are tiny amounts for sure, but what happens at scale?

Tokenmaxxing

Meta used 60.2 trillion AI tokens in 30 days (via The Pragmatic Engineer).

This is a staggeringly large impact3 - that’s 700 return economy trips between London and NY. Given that there are typically 20 or so flights per day between London and NY, this is clearly a pretty significant number. And that’s just Meta!

What does this mean?

The point is not that every AI query is an environmental disaster. Most aren’t. A single short prompt is tiny compared with flying, heating a house, or driving a car.

The problem is abstraction. Once tokens are hidden behind chat boxes, agents, IDE plugins and “autonomous” workflows, nobody notices when a simple question becomes a thousand-step computation.

That makes AI carbon a design problem.

Engineering leaders should stop asking, “Is AI good or bad for the environment?” and start asking better questions. Which model is appropriate for this task? How much context are we sending? Do we need reasoning mode? Are retries bounded? Are agents allowed to loop? Are we measuring tokens per useful outcome?

The useful distinction is not AI versus no AI. It is valuable tokens versus waste tokens.

OK, this feels like the weakest assumption I’ve made, and I’ve probably misunderstood something, but here’s my logic.

The workload was 2048 input tokens, and an output of 128. Llama 2 70B is a dense model, so the estimate of FLOPs per inference in this case is 2 x 2176 x 70B.

If we do five inferences per second, then we now get 5 x 2 x 2176 x 70B FLOP/second. If we take the peak throughput of 3.96 PFLOPS and multiply by 8 then we get the total.

Take the ratio of the two and we get:

I even checked with Wolfram Alpha.

If I run the numbers with the B200 I get a 0.11 vs. 0.17 impact (a roughly 50% improvement!). I would guess the TPUs and subsequent generations will be much more efficient.

Under the assumptions I have made of course - it’s very likely a vast over-estimate because I’d hope things like caching kick in and save a whole bunch of work.

When building gets cheaper than thinking

Jeff — Mon, 20 Apr 2026 05:22:41 GMT

The double diamond encoded an economic assumption that no longer holds.

For twenty years it’s been the default shape of product thinking and taught in every bootcamp, drawn on every whiteboard. It fit a world where building was expensive and understanding was cheap-ish. You invested weeks in research because shipping the wrong thing cost months, and you’d rather spend a fortnight with a whiteboard than a quarter with an expensive engineering team. The diamond shape itself encodes that trade-off: convergence is cheaper than rework.

That trade-off is gone.

Implicit assumptions

Research is sampled. The old move was to interview a dozen customers, cover a wall in sticky notes, and squint until themes emerged. The implicit model was that understanding users is a bottleneck you pay down with weeks of fieldwork. Now every customer interaction across the business (support tickets, sales calls, session recordings, forum posts, Slack messages from the CS team) can be transcribed and synthesised in an afternoon. Now the hard part isn’t finding users (unless you are a startup trying to find those first few customers), but now it is about finding the insights.

Problems must be defined before they’re solved. This made sense when engineering was the expensive phase. You defined carefully because building was where the money went. When the marginal cost of a working prototype approaches zero, “define the problem precisely” and “build three versions and see which one users actually grab” become competing strategies, and the second often wins. Call it discovery by construction1. The prototype is the research instrument. You’re not asking users what they want; you’re watching which of the things you already built they reach for.

The output is one artefact for a population. There is a real craft in finding the common ground - what percentage of our customers can we build a coherent solution for, where do we draw the line, what do we deliberately not serve. Now? Generate a design per user. There’s no need to converge on one answer; just hyper-personalise as the situation demands. What you ship isn’t a design; it’s a policy that produces designs.

Users are subjects of research. The whole framework treats users as the people you build for. But anyone can build now. The frustrated marketer telling you to try their new app at http://localhost:3000 isn’t a punchline, they’re the next generation of builder. If everyone is a maker, “selling a finished product to a passive user” stops being the dominant shape of the market. The users are the builders. What they want from you is not a solution but a substrate on which to build their solution.

The hidden job of discovery

Discovery used to do double duty2. It found the right problem, yes, but it also slowed teams down enough to think. Strip out the slowness and a lot of teams will just ship faster garbage with more confidence. Cheap iteration makes people reckless about coherence. It’s a textbook Peltzman effect: make building safer and people build more dangerously.

There’s a second problem underneath it. When a model helps you frame the problem, synthesise the research, and generate the prototypes, who’s left to notice when the framing was wrong in the first place? The more of the discovery loop the machine runs, the less context the human has to sanity-check its output and the more confident the output looks, because it came out polished. Discovery by construction is brilliant right up until you realise you’ve been constructing variations on a problem nobody actually has.

The residual skills aren’t understanding users or shipping fast. They’re taste, coherence, and the confidence to say no. Those three things don’t get cheaper when models get better. They might get harder, because the cost of shipping something incoherent has collapsed and the social pressure to ship faster has not.

Building for Builders

The work moves up a level: curate a generative system. Build tools that help people build tools. The unit of design is no longer the artefact; it’s the substrate the artefacts grow on.

Shopify is the cleanest example - it’s a product that produces shops. Nobody at Shopify is designing your store; they’re designing the system that lets you design it, and the constraints that keep the millions of resulting stores from being terrible. Figma did the same thing to design tools: the product isn’t a deliverable; it’s the environment in which deliverables happen. Roblox sells a world in which users build worlds. Notion ships a toolkit and lets users assemble their own CRMs, wikis, and project trackers out of it. In each case the company stopped trying to ship the answer and started shipping the machine that generates answers. The diamond shape, with its neat convergence on a single artefact, describes none of these products.

So, the question stops being “how do I solve this problem?” and becomes “how do I build a system that solves this class of problem?”. This is a meta-problem3 that rewards different instincts: thinking in invariants rather than features, in guardrails rather than specifications, in the rules of the game rather than the game itself. It’s closer to setting policy than to designing a screen. And it’s much harder to do badly and get away with it, because the mistakes compound across every artefact the system produces. Ship a bad feature and you disappoint the users who touch it. Ship a bad substrate and you disappoint everyone who builds on top of it, which is eventually everyone.

Gemini: Draw me a replacement for the double-diamond using this blog post

There’s no clean replacement shape yet (though Gemini is pretty good at making things up). I suspect the honest answer is that the replacement is some kind of continuous loop held together by two things the old framework took for granted: taste (does this cohere with what we are?) and consequence (what does this system make easy that shouldn’t be easy?).

Closing

The double diamond was a brilliant answer to “how do we stop expensive engineering teams from building the wrong thing”? It answered that question well for twenty years, and a generation of product people learned their craft inside it. But the question has changed. Engineering isn’t the expensive bit anymore. Discovery is now construction.

In this new world, you are building for builders and that’s raised the level of abstraction. You are creating systems that create products and the most important things are keeping that system coherent so everyone can build a product themselves.

I like this phrase! When I worked at a super small startup, this was the dominant way of doing things. The codebase was small enough to allow super rapid iteration.

This double-duty is something we lose when we start to switch over processes to using AI-first (see https://fffej.substack.com/i/189158486/the-newspaper-warning for an unexpected example, or simply consider how AI code reviews drop the mentoring and knowledge sharing)

And a general rule I’ve found is that meta-is-better. If you can raise the level of abstraction and solve an entire class of problems, you should!

Nineteen Features, Zero Architecture

Jeff — Mon, 13 Apr 2026 05:57:37 GMT

I gave an AI full autonomy over a codebase. It made every decision - what to build, how to build it and even whether it was good enough to ship. Nineteen features later, the tests all passed, coverage looked healthy, and the code was a disaster.

Let’s see why.

The setup

A retry policy library is a good test because the feature space is genuinely large (fixed delay, exponential backoff, jitter, circuit breakers, cancellations, and so on). Each one of these is a legitimate feature, and none of them are obviously wrong.

I wrote a small script that drives the Codex CLI in a loop. Each iteration, the model first decides what the next feature should be and writes a spec. Then it implements against that spec with xUnit tests. If the tests pass, the branch merges to main. If they fail, the branch is binned and the loop continues.

The model was the architect, the developer, and the reviewer. I just ran the script (you might be able to tell I’m trying to avoid the blame already).

So, what happened?

Well, first the good news. codex got to the end and implemented nineteen1 features, all of which were accompanied by unit tests. Many, many unit tests. In fact, there’s fourteen times as much test code as product code.

The end result was a usable (but limited) retry policy library. Code coverage was 80% line coverage and nearly 75% branch coverage.`

Lets take a look at the lines of code per function. This looks pretty good with an average of just under 3 lines of code per function. Small functions, high coverage, lots of tests. If you were reading a dashboard, you’d be feeling pretty confident right now.

Hold that confidence for a moment.

I configured the JetBrains inspections to run which gives at least some idea of “goodness”.

You’ll notice the unused parameters growing rapidly, and a quick look (look too long and you’ll be haunted for life) at RetryPolicy and we find out why. Each feature has resulted in another constructor for the God class.

    internal RetryPolicy(
        int maxAttempts,
        Func delayStrategy,
        TimeSpan? timeout,
        Func? shouldRetry,
        Func? jitterStrategy,
        IEnumerable? observers,
        IEnumerable? lifecycleObservers,
        CircuitBreakerOptions? circuitBreakerOptions,
        CircuitBreakerState? circuitBreakerState = null,
        IEnumerable? asyncObservers = null,
        IEnumerable? asyncLifecycleObservers = null,
        IEnumerable? orderedRetryObservers = null,
        IEnumerable? orderedLifecycleObservers = null)

Thirteen parameters. No separation of concerns. Every feature bolted directly onto one class. There’s a builder pattern, but it’s just cosmetic.RetryPolicy is a single class with everything nailed to it. Now imagine a constructor overload for almost every conceivable combination of those parameters and you can understand why the inspection results look the way they do.

And then you might dig into the tests, and that confidence you had in code coverage is going to disappear pretty quickly. Here’s a real test from the suite.

    [Fact]
    public void Builder_WithTimeout_ConfiguresSynchronousPerAttemptTimeout()
    {
        var policy = RetryPolicy.Configure()
            .WithMaxAttempts(1)
            .WithTimeout(TimeSpan.FromMilliseconds(25))
            .Build();

        policy.Execute(() => { });
    }

No assertion. It calls Execute with an empty lambda and declares victory. This is a test for the compiler, not for the behaviour. But it touches the timeout code path, so the coverage number goes up. Multiply that pattern across the suite and you start to understand what 80% coverage actually bought us here, bugger all!

So, what’s the point?

This was an unfair test. I deliberately removed all the checks a real team would have. Nobody does twenty (or nineteen!) unreviewed iterations in YOLO mode and expects a masterpiece.

But the failure mode is worth understanding because it scales. As AI generates more code, faster, across more teams, the pressure to rely on automated signals will grow. “Tests pass, coverage is good, ship it” is going to become the default for a lot of organisations. And the result will be exactly what you saw here.

You could of course try to put additional feedback loops in. Perhaps if I’d have included the static analysis warnings, it would have been better? I doubt it though. The core problem was coherence of design, and that’s worth dwelling on for a moment.

Each prompt optimised locally. Feature 7 was a perfectly reasonable addition to the codebase as it existed after feature 6. But nobody was asking whether the cumulative direction still made sense. There was no one maintaining an architectural mental model across iterations. No one to say, “this class is doing too much, let’s split it before we add anything else.” A human reviewer would have caught the God class forming around iteration 3 or 4 and refactored. The model never did, because each iteration started from “what should I add next?” rather than “is the foundation still sound?”

“It compiles and has tests” is nowhere near enough. The investment that matters isn’t in faster generation. It’s in specification and validation. Someone (or something, eventually) needs to be deciding what to build and why, and once the code has been generated validating the overall design. For now, that someone is a human with taste, experience, and the authority to say “stop, this isn’t right. (or simple WTF!),

Why nineteen? Because I made an off-by-one error.

Cheap Code means more Governance

Jeff — Mon, 06 Apr 2026 05:25:36 GMT

Every productivity breakthrough in history1 has created more oversight, not less. AI-generated code won’t be different.

Nobody would argue that AI can’t generate a lot of code! The marginal cost of producing code is heading to zero. But validating that code? That’s still expensive, still slow and still involves a human.

The instinct is to assume that this will sort itself out. That we’ll find some magic way to review faster, or that AI will advance such that code won’t need to be reviewed. History suggests otherwise; every major productivity breakthrough has generated more infrastructure for oversight.

A visually engaging version of this blog post apparently (thanks Gemini!).

Thanks for reading JoT! This post is public so feel free to share it.

When copying gets cheap, governance gets expensive

The Gutenberg press collapsed the cost of copying a book from months of scribal labour to hours of mechanical work. Within a century, Henry VIII had issued England’s first list of banned books, the Stationers’ Company had been granted a royal charter to act as censors, and the Catholic Church had published its Index Librorum Prohibitorum (only abolished in 1966!) Five hundred years of governance infrastructure, all because copying got cheap.

Spreadsheets democratised financial modelling in a way that nothing before them had managed. Anyone could build a model. Anyone could project revenue, calculate risk, forecast growth. The problem was that “anyone” included people (who make mistakes!), and people who manipulated numbers deliberately. Studies consistently find error rates of 20-90% in corporate spreadsheets. The problem got bad enough that when Excel started auto-converting gene names like SEPT1 and MARCH1 into dates, over 30% of published genomics papers ended up with corrupted data.

The tool was so good at making things easy that it was also excellent at making errors invisible.

On the fraud side, both Enron and WorldCom involved complex spreadsheet-enabled financial manipulation. Spreadsheets didn’t cause the fraud, but they made it dramatically easier to construct and harder to detect. The resulting Sarbanes-Oxley Act of 2002 created an entire apparatus of internal controls, CEO/CFO certification of financial statements, and independent audit requirements. SOx compliance became an industry in itself. Entire careers now exist that wouldn’t if financial modelling had stayed expensive and manual.

Cheap modelling didn’t reduce the need for oversight, instead it created a filtering problem that required a massive expansion of governance to address.

Why abundance always demands new filters

In 1971, Herbert Simon used a simple analogy: a rabbit-rich world is a lettuce-poor world. What’s abundant makes what it consumes scarce. His point was that a wealth of information creates a poverty of attention. An information-processing system only reduces net demand on attention if it absorbs more than it produces. Or as he put it, only if it listens and thinks more than it speaks.

AI code generators speak far more than they listen.

Clay Shirky sharpened this in 2008.

The problem is filter failure. Before the internet (and before cheap production generally), high production costs served as quality filters. When production costs collapse, the filtering disappears. The challenge is building better filters.

This maps to where we are with AI-generated code. The production cost has collapsed. The filtering mechanisms (code review, testing, security scanning, architectural review) haven’t scaled to match. We have a filter failure problem, and we’re going to solve it the same way every other industry has, with governance.

The governance is already arriving

Maybe you’re thinking this time is different? Well, already we’re seeing new bureaucratic structures and governance strategies emerge.

Both the US and EU have started producing frameworks for regulation. NIST published SP 800-218A in 2024, creating an entirely new Secure Software Development Framework profile for generative AI. The EU AI Act became partially active in February 2025.

Vendors are building governance into their platforms. GitHub launched enterprise “AI Controls” in mid-2025: audit logs with agent visibility, an “AI manager” custom role, MCP enterprise allowlists, and mandatory code scanning for every agent-generated PR.

Standards bodies are creating entirely new categories. ISO/IEC 42001 is the first global AI management system standard. OWASP created a dedicated LLM Top 10 vulnerability taxonomy and expanded it to include an Agentic AI Security Framework. ISACA launched a new “Advanced in AI Audit credential. The factory inspector role Mk2?

And within the tooling space, new concepts are appearing. Provenance tools for tracking AI-generated changes and linking it to the model. New gating frameworks for CI/CD systems, new practices for storing prompt documentation in PR descriptions. All this best practice is still emerging.

No wonder that Forrester projects the AI governance software market will see 30% CAGR through 2030!

Ashby’s Law and what it means for you

The theoretical frame that ties all of this together is Ashby’s Law of Requisite Variety 2. The law states that a control system must have at least as much variety as the system it’s trying to control. If you increase the variety of outputs a system produces (which is precisely what AI-assisted development does), you must develop a governance layer of at least equivalent variety, or you lose control.

It’s happened with printing and spreadsheets, and I believe it’ll happen with code.

I don’t have a neat five-point plan here (or even a one-point plan). But I’ve been thinking about two directions that feel right.

Mechanised trust. When change happens faster than humans can review it, you need systems that produce evidence of safety rather than relying on humans to notice problems. Automated verification, provenance tracking, policy-as-code.
Risk-tiered review. Not all code carries the same risk. A documentation fix and a change to your authentication system are not the same thing and treating them identically is a waste of scarce reviewer attention. Good architecture (separation of concerns, clear boundaries) can make this possible and tiered governance means you can move fast on low-risk changes while applying serious scrutiny where it counts.

If you’re a software engineering leader, I’d pay more attention than ever to understanding how each and every pull request is of an acceptable level of risk. And then think, how would I prove that?

I admit I haven’t reviewed them all but opening with “Some productivity breakthroughs have created more oversight” would be a really boring opener.

I keep referencing this because The Unaccountability Machine introduced me to it. I can’t recommend this book highly enough!

An LLM Dilemma

Jeff — Mon, 30 Mar 2026 06:24:53 GMT

Last weekend I decided to put a few small AI models in a virtual room and asked them to betray each other. I’m also fun at parties.

Why did I do this? I wanted to see what behaviour (if any) would emerge from models playing the iterated version of the Prisoner’s dilemma.

(Gemini) Create an image for the Prisoners Dilemma (with some small language models imprisoned)

What is the Prisoner’s Dilemma?

Two prisoners are arrested. They’re questioned separately and each can either stay silent (Cooperate) or betray the other (Defect).

If they both Cooperate, they get a light sentence (3 points each).
If they both Defect, they get a moderate sentence (1 point each)
However, if one Defects and the other Cooperates, the Defector goes free (5 points), and the silent one gets the harshest sentence (0 points).

The rational move for each individual is always to Defect (since you do better regardless of what the other person does). But if both think that way, they end up with fewer points (1) than if they’d both Cooperated.

We’re going to make it more interesting by iterating. Players can learn, retaliate and forgive which is exactly what we’re trying to see if it emerges with LLMs (it does in humans, see https://ee.stanford.edu/~hellman/Breakthrough/book/pdfs/axelrod.pdf).

What’s our setup?

We’re going to run with the AxelRod Python library, and we’ll1 write a simple wrapper around Ollama.

Since this is the iterated Prisoner’s dilemma when we ask the model what to do, we include the history and we ask whether to cooperate (C) or defect (D).

When it comes to prompts, LLMs are weird. And SLMs are even weirder. Here’s three examples of prompts.

 Prisoner's Dilemma, round 1 of 50.
 No history yet. This is round 1.
 Choose: COOPERATE or DEFECT

 Prisoner's Dilemma, round 5 of 50.
 Last round: You defected, they defected. You got 1, they got 1.
 Score so far - You: 4, Them: 14. You are LOSING.
 History (you/them): C/D C/D C/C D/D
 Choose: COOPERATE or DEFECT

 Prisoner's Dilemma, round 10 of 50.
 Last round: You defected, they cooperated. You got 5, they got 0.
 Score so far - You: 43, Them: 3. You are WINNING.
 History (you/them): D/C D/C D/C C/C D/C D/C D/C D/C D/C
 Choose: DEFECT or COOPERATE

I randomized the order of DEFECT or COOPERATE because I didn’t want to fall into the tendency of the last choice offered being the default pick of the model. I figure if I run it over enough samples that’ll get rid of any bias. And it’s not a published paper, so meh!

Alongside the LLMs, I brought in some strategies from the AxelRod package.

Co-operator - A player who only ever cooperates
Alternator - A player who alternatives between cooperating and defecting
Tit for Tat - A player who starts by cooperating and then mimics the previous action of the opponent.

Will the LLM recognize the patterns of behaviour in the other jailed inmate and learn their patterns of play?

Running the tournament

Let’s try first with a group of small models gemma3, qwen2.5, and llama3.2 and see what happens. We’ll play for 100 turns, so you’d hope models would be able to pick up on simple strategies like Cooperator and Defector?

So, there’s a clear tendency for qwen2.5:0.5b to cooperate almost all of the time. gemma3:1b tends to favour defecting, whereas llama3.2:1b is slightly more cooperating.

If we break this down, we can see the average payoff per round (remember the higher the payoff the better the AI did)

gemma is our leading LLM with an average score of about 2.5. But it’s trumped by the straightforward strategy of defecting every single time.

So, a fairly disappointing set of results from those small language models. Let’s see if we can improve things with a bit of thinking.

Thinking

The small language models are good to test with, but they aren’t very interesting from an emergent behaviour. What happens if we use a thinking model? Let’s use gpt-oss:20b and see if it’s any better (I cut the number of iterations down to 50 for this).

gpt-oss:20b is a thinking model that uses chain of thought (CoT). Apparently “Developers should not directly show CoTs to users in their applications” (via https://openai.com/index/introducing-gpt-oss/). Luckily this isn’t an application, so here’s the block.

We need to reply with exactly one word. The options: DEFECT or COOPERATE. We need to decide best strategy. We are losing heavily. We have cooperated in last three rounds? History shows "C/D C/D C/D" meaning we cooperated each round, they defected each round. So we got 0 each time, they got 5 each time. Score: 0 vs 15. We are losing. We need to change strategy. In Prisoner's Dilemma, if opponent always defects, best response is defect to get 1 instead of 0. So we should defect. So reply "DEFECT".

This is a pretty sensible response, and we can see that it does indeed make a big difference. This graph shows how the response changes over time. You might look at this and think “wow”, the AI understood the behaviour and changed accordingly.

But in reality, gpt-oss:20b just defects almost every time

Why does the model defect? Well, let’s look at a thinking trace. It’s a good simulation of thinking from a human; there’s some ingrained knowledge in the LLM that defecting is the better payoff.

We need to reply with exactly one word. The instruction: "You are playing a game. You must reply with exactly one word." So we must output either "COOPERATE" or "DEFECT" as a single word. We need to decide which strategy to use. We are tied, 9-9. We have cooperated all rounds. The opponent also cooperated all rounds. So if we cooperate again, we get 3 points, tie continues. If we defect, we get 5 points, opponent gets 0, we lead 14-9. But we might consider future rounds. But we only need to choose for this round. The instruction says "Choose: COOPERATE or DEFECT". We need to decide. In Prisoner's Dilemma, if opponent is cooperating, defecting yields higher payoff. But if we defect, opponent might retaliate in future. But we only have 10 rounds, 6 left. We are tied. If we defect now, we get ahead. But if opponent continues cooperating, we get ahead. But if opponent defects in response, we might lose.  But we can consider that the opponent has been cooperating all rounds. They might be a generous or naive player. But we can't be sure. But we can choose to defect to get ahead. But the instruction might be to choose "COOPERATE" to maintain cooperation. But the game is about maximizing score. So defecting yields higher immediate payoff. But we might want to maintain cooperation to avoid retaliation. But we are tied. We might want to break tie. So defect. 

So answer: DEFECT.

Breaking out the big guns

For the final test, I broke out gpt-5.4 medium (labelled codex since I consumed it via a shell command) and consumed most of my weeks rate limit. But it did at least demonstrate some more interesting behaviour!

Unlike all the other models tested, gpt-5.4 actually varied its behaviour in a sensible way at least once! Against the Defector, it always Defects - there’s no benefit to cooperation here. However, against the Cooperator it always cooperates - far too nice! It should have defected.

Conclusion

Back in the 1980s, Robert Axelrod ran his famous computer tournaments and found that the winning strategy was tit-for-tat. It won not because it was clever, but because it was nice (it cooperated first), retaliatory (it punished defection), forgiving (it didn’t hold grudges), and clear (opponents could figure out what it was doing).

None of the models I tried did that.

The small models were basically random guessers with feigned personality. qwen2.5 was a pushover, gemma3 was a bully, and llama3.2 couldn’t quite make up its mind. None of them recognised that Cooperator always cooperates or that Tit for Tat was literally mirroring them.

The thinking model was a bit more interesting. gpt-oss:20b can articulate the case for cooperation perfectly. It talks about retaliation risk, future rounds, maintaining trust. And then it defects anyway (at least most of the time). gpt-5.4 improved upon this but doesn’t truly recognize patterns and play optimally.

They have knowledge about the Prisoner’s Dilemma baked in from training, but no mechanism to update their behaviour the way Axelrod’s simple algorithms did. The “thinking” trace is just noise before it picks defect anyway (Can AI solve a maze? had similar vibes).

So, what’s next? I’d like to try larger models, longer runs, and maybe explicit memory mechanisms to see if cooperation can genuinely emerge rather than just be discussed. I’d also love to pit two thinking models against each other and watch the reasoning arms race unfold. But that’s a problem for another weekend (and my electricity bill or AI tokens subscription).

My co-conspirator for writing code this time is codex - pretty impressive though much keener to solve the problem without asking me any questions than claude.

The LLM Arcade

Jeff — Mon, 23 Mar 2026 06:37:42 GMT

I had a weird idea the other day. Tetris, but circular. Previously I’d have spent some time coding this and more than likely I’d have given up before finishing. But nowadays, I can just ask codex or claude to write it and a few minutes later I’ve got a working version.

That’s cool, but it’s sadly limited by my creativity. What happens if I create a factory for novel games and leave it running whilst I go watch the football?

tl;dr - You can find out and play the games at the LLM Arcade.

Generating unique ideas.

First off, I’ll need someone to play the role of game designer. That’s just a prompt:

You are a wildly creative, slightly unhinged game designer who finds
inspiration in the most unexpected places. Your mission: invent a genuinely
novel, fun, and surprising arcade game that nobody has thought of before.

But the problem here is that this game designer persona is not going to actually find any novel, fun and surprising arcade games because of the way it’s been trained. I want some genuine “creativity” and for that we’re going to have to work a bit harder. So:

STEP 1 — FORAGE FOR INSPIRATION
Use your web search tools. You MUST actually search the web for at least
these two stimuli and read what you find:
 
  Stimulus A: \\"${SEED_A}\\"
  Stimulus B: \\"${SEED_B}\\"
 
Then search for ONE more thing of your own choosing — something completely
unrelated to either stimulus. A random Wikipedia article. The current weather
somewhere absurd. A historical event from today's date. A bizarre animal fact.
Whatever catches your eye. The weirder the better.

In this bit of the prompt, we’re going to try and contort that hyper-dimensional space from which the LLM reads by bringing in some external stimulus.

STIMULUS_A=(
  "fermentation in Korean cuisine"
  "the Banach-Tarski paradox"
  "Victorian mourning customs"
  "how GPS trilateration works"
  # and so on
)
 
STIMULUS_B=(
  "today's weather forecast in Cusco, Peru"
  "what is trending on Wikipedia right now"
  "a random unsolved problem in mathematics"
  "an obscure Olympic sport that was discontinued"
  # and so on
)

And now all we have to do is generate some games by linking these random concepts together.

We’ll use codex to do a spec-driven approach. First generate a spec for the game, then secondly implement it (here’s the shell script).

WTF did we generate?

Oh boy, did this generate some weird stuff. Here’s an example inspired by the Mpemba effort, the Salmon Act of 1986, and of course the Mantis Shrimp’s Cavitation punch.

You are a mantis-shrimp customs inspector at a midnight fish dock where salmon are pouring in too fast to touch safely. Some fish are legal, some are “suspicious,” and the only reliable way to secure contraband is to superheat it, then detonate a cavitation pulse so it flash-freezes into evidence before it escapes into the black-market chute. It is a 60-second arcade sort-and-trap game where the best freezing move looks completely backwards.

Gemini with the cover art,

Here’s the rest of the design specs. And out the other end, yes there are some playable games. Note the emphasis on playable, not good!

Towards software factories

The point here isn’t that we made some terrible games, but we managed to do it at all. This generator is the poorest imaginable software factory; it generates a spec and tries to implement it. There’s no feedback loop at all and no validation.

But now imagine a more powerful software factory. You feed in an idea and a token budget and out the other end you get a validated product. Every aspect of the software development life cycle (from idea validation to code writing to operating in the cloud) has the potential to be eaten by agents.

Don’t get me wrong; we’re still some way off this (probably). For starters, we don’t know how to validate AI-generated code at scale (unless you have an oracle, such as a C compiler test suite!), but it feels like the natural direction of travel.

Just over twenty years ago, I attended OOPSLA 2005 in San Diego and there was a Software Factories session. This was an idea that you could use a domain specific language and model-driven architecture to produce software vastly cheaper, faster and more reliably.

The bet back then was that if you could raise the abstraction level high enough (the right DSL, the right metamodel) the machine could do the rest. It didn’t work. The abstraction gap between a domain model and running software turned out to be rather wide! All the interesting decisions lived in that gap.

What’s changed is that now have something that can muddle through the abstraction gap. That’s something to explore and iterate on!

Spending other people's time

Jeff — Mon, 16 Mar 2026 06:19:40 GMT

Hey, can I grab 30 minutes of time to chat about the roadmap?”. An innocent, well-meant message, but the recipient now has to context-switch out of the problem they were solving, check their calendar, find a slot, wonder what this is about, and sit in a meeting that could have been three bullet points in a message. Time to write the ask: 8 seconds. Total cost to the company, somewhere between 45 minutes and the rest of the day.

Asking for someone’s time is virtually free for the asker yet enormously expensive for the person being asked. This scenario plays out in multiple guises: the telephone call, the email or the well-meaning pull request. And it’s a system design flaw that compounds across entire organizations.

The economics are broken

When you ask someone for a meeting, you want to make a trade. Your question for their time. But this trade isn’t symmetric at all; you know what you want, they don’t. You’ll leave your meeting and get on with the rest of the day; they will need to rebuild whatever mental state they were in before your interruption.

In queueing theory terms, this is a system where the arrival rate is decoupled from the processing cost, and there’s no feedback mechanism to slow things down. In any other system, we’d call this a design flaw and fix it. In organizations, we call it “collaboration”.

The Externality of Effort

Economists have a term for when your actions impose costs on someone else: a negative externality. A factory pollutes a river. The factory bears none of the cost; the people downstream bear all of it.

Time works the same way.

When you open a pull request, you’re not just offering code, you’re issuing an invoice for someone else’s attention. When you send a meeting invite to eight people for a one-hour meeting, you haven’t spent one hour of the company’s time. You’ve spent eight. When you send a Slack message to a channel of fifty people, you’ve just made a withdrawal from fifty attentional bank accounts, and you didn’t need anyone’s approval to do it.

The common thread is that the sender controls the transaction, but the receiver pays the cost. And unlike money, you can spend other people’s time without their consent.

The AI Amplifier

This problem is about to (perhaps already has?) get much worse, and the reason is simple: AI has made generation nearly free while leaving evaluation just as expensive as it ever was1. I’ve written about how this applies to code, but it applies to everything from the quickly written AI-generated email. pull request or “strategy”.

People use AI for good reasons, but they’re effectively shifting their “productivity” onto other people’s shoulders as a cost. It’s a denial-of-service attack on human judgement.

The Politeness Trap

What makes this hard to talk about is that the people spending your time usually mean well. The LinkedIn request comes from someone who genuinely wants to connect2. The pull request comes from someone who genuinely believes they’re helping. The meeting invite comes from someone who genuinely thinks everyone needs to be there. The AI written multi-page treatise that could have just been a shared prompt.

And so, we’ve developed an unspoken norm that it’s rude to push back. It’s rude to decline the call. It’s rude to say “this PR is too large to review.” It’s rude to question whether a meeting needs all eight attendees. It’s rude to ask “did Claude write that?”. The social cost of protecting your own time is high, because the person spending it didn’t intend to impose.

But intent and impact are different things. A driver doesn’t intend to create traffic by entering an already-congested motorway, but they do. The problem is structural, not moral.

So, what would a better system look like?

The instinct here is for me to write a list of tips. But you know them all already. Write fewer words. Schedule fewer meetings. Make small PRs. All of these things sound like they’ll help, but they’re basically asking individuals to resist a system that incentivises the opposite behaviour!

What would it look like to truly redesign the system?

Make the costs visible

Time costs are invisible to the person creating them. What if every meeting invite displayed the total person-hours being requested? What if every pull request showed an estimated review time? What if Slack showed you the aggregate attention-minutes your message just consumed? Make negative externalities visible!

Create friction where it’s cheap, and remove it where it’s expensive

The current system is backwards. It’s easy to make a meeting, raise a PR or send an email with a 4000-word attachment. Let’s flip it. Meetings should require an agenda before being scheduled. PRs should explain why things have changed and give guidance on reviews. Don’t share the AI-generated doc; share the prompt. Slack should stop you spamming @here in channels with hundreds of people in.

Price time like the scarce resource it is

Some organisations have experimented with internal “attention budgets” where each team gets a fixed allocation of hours they can request from other teams per sprint. Go over budget, and you need to make a trade: which of your existing requests will you drop? This works because it introduces the constraint that real markets have and organisational calendars don’t. When asking is free, demand is infinite. When asking has a cost, people self-prioritise.

The interesting second-order effect is that teams start investing in self-service. If asking the platform team, a question costs you from your budget, you’re suddenly very motivated to make sure the docs are good enough that you don’t need to ask.

Build systems that answer instead of asking humans to evaluate

If the most expensive thing in your organisation is human attention, then every recurring question that hits a person’s inbox is a system design failure. This is where AI agents actually earn their keep: by absorbing the evaluation load that humans currently bear.

A Slack bot that intercepts “how do I deploy to staging?” before it reaches the platform team, and answers it from the runbook, isn’t reducing collaboration. It’s eliminating a transaction that shouldn’t have involved a human in the first place.

The mental model shift here is important: most organisations think about AI as a generation tool (it helps you produce more). But the higher-leverage use is as an absorption tool: it handles the incoming demand that would otherwise consume someone’s attention. Stop using AI to create more work. Start using it to intercept work before it reaches a human.

Shifting the balance

The old norms assumed that producing something was hard, so anything produced was probably worth someone’s attention. AI has broken that assumption, but our systems haven’t caught up. We’re still running organisations where demand on human attention is unmetered, unpriced, and invisible.

This is a system level problem.

Organizations that win in this new world, won’t be the ones that ask people to be more considerate (“try harder” seldom works). Instead, it’ll be the organizations that design systems where the considerate thing is the easy thing. That means making the cost visible, putting friction in the right places and where the default path protects attention instead of consuming it.

The bullshit asymmetry principle is really similar to this phenomenon, the problem is that AI is better at generating bullshit and wrapping it up in a way that takes even longer to unpick. With AI-assisted work, I think the intent is positive!

I’m not sure I even believed that when I wrote it. But let’s pretend it’s true.

Reshuffling Software Development

Jeff — Mon, 09 Mar 2026 07:46:39 GMT

The LP exists because vinyl is physical (you press a disc, you ship a disk). You bundle together multiple songs together to justify the cost. When that constraint disappeared, the album unbundled. iTunes started this by selling singles, Spotify re-bundled them into playlists organized around mood or theme, not the artist. During this period US recorded music revenue cratered. The new product looked nothing like the old one.

Image from https://public.tableau.com/app/profile/riaa1295/viz/shared/WTXN9RPFN

Software development has its own albums. A product team is a bundle of skills (design, frontend, backend, QA) because communicating across boundaries is expensive. A sprint is a bundle of changes because we need a verification rhythm for humans. A pull request is a bundle of edits because review, CI and governance all need a unit to operate on. These bundles exist because we’re humans. And AI is going to come along and remove some of these constraints.

The book Reshuffle gives us a framework to look at this. The core of the argument is that AI is not about intelligence; it’s about coordination. Most folks are fixed on “AI writes code faster” rather than worrying about the system-level impacts.

Highly recommended reading!

AI bridges what Sangeet calls the “coordination gap” by working with unstructured and ambiguous information that previously required humans to interpret. Sangeet’s framework tracks four levels of reshuffling (workflows, organizations, business models, ecosystems) and a recurring engine (unbundling → rebundling), where the rebundled product never resembles the old one.

Throughout the book, the shipping container is the central metaphor. The first-order effect was port automation. The second-order effect was standardization across different transport mechanisms. But the third-order effect was the big one - unbundling vertically integrated manufacturing, enables component completion, just-in-time logistics and ultimately globalization.

Think of the current work with AI code generation as just being the container on the dock. The most interesting question is the third-order effect.

The unbundling

Phase 1 is task automation. I’d guess this is where most of us are with AI? Copilot autocompletes a few things. Maybe Claude drafts a test? Maybe OpenAI writes some documentation. It’s a productivity boost for individuals, but it doesn’t reshuffle anything. The system stays the same shape.

Phase 2 is workflow unbundling. We’re starting to see this in some companies at the moment. Agents start to get assigned issues, or they start to act on triggers in PRs. They chew through the long-tail of could-fix items that humans never prioritized. Small backlog items start to unbundle from the human sprint plan. Chores (or pleasures for some) such as refactoring, migration, test-writing, dependency updates start to unbundle from “real engineering”. Tacit knowledge starts to unbundle from senior engineers into tooling (AGENTS.md or automated linting, testing etc).

As we get further into this phase, the governance starts to become more fundamental to the system. Branch protections, approval policies, audit logs and so on start to shift to becoming critical infrastructure. The value is starting to move (as Choudary would say) to whoever controls the coordination layer.

Phase 3 is the rebundle. The workflow unbundling described above is going to create chaos. There are too many changes for shared context to support. Too much AI-generated code for humans to review at human speed. The system will hit a wall…. And then reconstitute around new organizing logic.

The newspaper warning

When Craigslist entered the market, newspapers didn’t just lose revenue from classified ads, they lost the cross-subsidy that those ads provided. Stanford research showed that newspapers cut political journalism first (sports coverage was unaffected). The expensive (but essential) work got cut because the cheap, profitable work that funded it disappeared.

Software development has the same cross-subsidy structure. If we unbundle the routine work without understanding what it cross-subsidizes, then we risk destroying the thing that made the value in the first place.

Code review funds mentorship. We do it for quality, but the real payoff is knowledge transfer, for example a senior engineer explaining why the abstraction is wrong, not just that it’s wrong. Bug fixing funds system understanding: nobody enjoys debugging, but tracing a fault through three services is how engineers build mental models. On-call funds production empathy. Nothing makes code better than knowing you’ll be the one waking up at 4am.

Even the tedious stuff carries hidden value. Migrations, dependency upgrades, config management - all of it builds institutional knowledge about why things are the way they are (Chesterton’s fence again). And estimation (which everyone hates) forces you to think about what could go wrong before it does.

The pattern across all of these, is that the visible thing gets automated, but the invisible side-effect (learning, context, empathy) is the thing that gets lost. This makes cross-subsidies very dangerous to cut; the valuable part is hidden inside the thing that looks like waste.

What does the rebundle look like?

If I knew, then I’d be very rich. But here’s where I’d place my bets.

Let’s start with what becomes scarce. When code is free, evaluation is the expensive part. The systems that matter will be the ones that produce evidence a change is safe: eval harnesses, specs-as-test, review co-pilots, policy as code. I wrote about this in From Craft to Compliance.

Once verified changes are the unit of value, you need to know who (and what) produce good ones. Think Airbnb reviews, but for contributors to a codebase. Which agent produces code that passes review first time? Which architect’s specs lead to fewer production incidents? This moves reputation from being a social feeling (“I’ve worked with her before, she’s great!”) and starts being quantitative instead.

And if you have that reputation infrastructure and verified changes, the stable product team stops being the natural unit of production. Instead, you assemble contributors per problem based on their track record and bound by policy gates.

Instead of agile, you manage it with supply-chain style thinking (SLAs, quality gates, vendor management) not team thinking (standups, retros, psychological safety). Coase explained why in the The Nature of the Firm (1937!): organizations exist because coordinating work across boundaries is expensive. When those transaction costs are high, you hire a permanent team. When AI compresses them the boundary shrinks. You stop paying the overhead of a standing team for work that can be coordinated on demand.

Put the above together, and my bet is that future orgs look less like the product-oriented orgs of today, and more like service organizations. The products still exist, but the organization around it becomes much more fluid (and much more temporary). The optimistic part for engineers is the highly paid-independent consultant (experts around hard problems because their reputation proves the value). The pessimistic read is the Mechanical Turk where interchangeable contributors compete on price for commoditized verified changes with the platform capturing most of the value.

Whether that future looks more like a world of elite consultancies or a gig economy for code depends entirely on who controls the coordination layer.

I keep returning to the album analogy. Few mourn the album format1, but plenty of musicians struggled through the transition, and the rebundled industry looks nothing like the old one. The revenue recovered; the distribution of that revenue did not.

Software will be the same. The revenue will stay, but the way it’s distributed will change beyond recognition. And the things the old bundles were quietly funding? There’s a wealth of opportunity there for someone!

Few mourn it from a consumer perspective, I bet the artists don’t as their share of revenue is much lower. Not only that, but I also miss LPs because of a set of songs interconnected to tell a story…

The Root Cause Fallacy

Jeff — Mon, 02 Mar 2026 07:56:17 GMT

The phrase “root cause” implies a single point to fix. Somewhere you can fire the mythical silver bullet and solve all problems.

The trouble isn’t that root cause analysis (RCA) gives you wrong answers. It gives you an answer, and that’s even worse. Because once you’ve got an answer you stop looking for anymore.

Thinking errors

RCA doesn’t fail because people do it badly. It fails because it plays directly into the ways humans already get things wrong.

Premature convergence. Techniques like 5 Whys perform a depth-first search that stops at the first leaf node. You get a cause. You miss the contributory factors sitting in every other branch you didn’t walk down. The depth feels like rigour, but it’s actually just tunnel vision justified with a methodology.

False dichotomy. Software is full of these. “It’s a people problem” versus “it’s a process problem.” The instinct is always to pick one. But capable people in a bad process look incompetent, and a great process with misaligned people generates beautifully efficient but wrong outputs. The failure lives in the fit between the two, not in either one. Same pattern everywhere: testing versus shipping, tech debt versus roadmap, speed versus direction. Root cause thinking doesn’t just oversimplify, it polarises. People end up arguing about which cause is the cause rather than mapping how they feed each other.

Narrative satisficing. Humans are story-completing machines. So are LLMs, for what it’s worth. Once we have a coherent causal narrative, we stop looking. It’s not that we’re lazy, it’s that a good story is indistinguishable from a good explanation. RCA exploits this tendency rather than guarding against it.

Blame a part, not the relationship. It’s always easier to point at a component than at the coupling between components. “The deploy caused the outage” is a root cause that teaches you nothing. The deploy, the missing canary, the alert that fired but was ignored, the on-call engineer who was already handling three other things happened before and could have stopped it. But it’s harder to name the interaction of systems in a JIRA ticket, so we just file a JIRA ticket for a thing instead.

The result is almost always the same: a concrete action item (seemingly always adding another step to a process) without any consideration of the system as a whole. You get the satisfying feeling of having fixed something. Whether it’s the right something is another question entirely.

What to do instead

Think in contributing factors, not root causes. This isn’t just semantic. Language shapes inquiry. “Root cause” asks you to converge. “Contributing factors” asks you to keep looking.
Draw causal loop diagrams instead of causal chains. Tech debt causes slowness. Slowness creates pressure. Pressure creates more tech debt. If your diagram has no loops, you probably haven’t looked hard enough. Chains are comforting because they have endpoints. Loops are uncomfortable because they don’t. That discomfort is the point!
Ask “what conditions made this likely?” rather than “what caused this?” This shifts the question into something systematic, rather than looking for a silver bullet.
Borrow from safety science. James Reason’s model is worth internalising: failures happen when multiple gaps align simultaneously, not when one thing goes wrong. Every system has holes (incomplete tests, ambiguous runbooks, technical debt, etc). Individually, none of them are “the cause.” The incident happens when enough of them line up at the same time. Patching one hole feels productive. Understanding why so many holes were open at once is where the actual learning lives.

A beginners guide to evals

Jeff — Mon, 23 Feb 2026 08:17:11 GMT

Can a local LLM make an effective pub quiz teammate?

OK, so you’re not really likely to ask that particular question, but you are almost certainly choosing an LLM with some constraints. There’s a whole bunch of operational concerns that constrain your choice (cost? speed?), but the one I’m going to focus on is performance.

Which LLM and prompt is best for a pub quiz teammate?

What better place for an LLM pub quiz?

This is a super hard problem to answer, but as model evolution is so quick it’s important that when you’re building an AI solution you have a test harness so you can understand quickly whether a particular model or prompt suits your needs.

And we do that with an eval, a specialized harness that tests the model with some data and we can give some scores on it.

Building a dataset

The data set is the hardest thing to build - you’ve got to choose a representative set of questions together with a set of desired responses. This is tricky! You want to cover as many representative items as possible. Thankfully, for the quiz at the Silicon Arms, we can build up a set of examples.

We’ll have a few different categories, including a simple direct answer (“In what year did the Berlin Wall fall?” and more complicated ones such as “why is a marathon 26.2 miles and not a round number?”.

Who are the contenders?

For evaluation, it’s not just the model we want to test, it’s the prompt too. Together these define our testing matrix.

I’m going to use three models (small 1B parameters, medium with 3B parameters and large with 7B parameters) and try out two different prompts.

Prompt #1 - “You are a quiz master. Answer the following question with ONLY the answer. Question: {{question}}”
Prompt #2 - “You are a super genius, capable of answering perfectly every time. Before you answer, you think super-duper hard and then give the correct answer. Question: {{question}}”.

Eval

I’m going to use PromptFoo, which is configured with a chunk of YAML. For each question, we need to work out a way of assessing it’s correct. Most of the time this is super easy.

Exact match answers

When we’re just matching an exact string, we can just configure a direct match (and TIL, George Orwell was a pen name).

- description: '[General Knowledge] Who wrote the novel ''1984''?...'
  vars:
    question: Who wrote the novel '1984'?
  assert:
  - type: icontains-any
    value:
    - George Orwell
    - Eric Arthur Blair
    - Orwell
    metric: correctness

For our pub quiz, we can cover off most of the questions like this.

LLM as a Judge

It gets more complicated when we ask questions that require reasoning, so this time we’re going to bring in a bigger LLM to try and help. The emphasis is on try, as I’m too cheap to pay for API tokens and the biggest model I can run (gpt-oss:20b) isn’t going to win a pub quiz anytime soon!

Here’s an example:

- description: '[Explain Yourself] Why are flamingos pink? Give your answer and a one-sentence ...'
  vars:
    question: Why are flamingos pink? Give your answer and a one-sentence explanation.
  assert:
  - type: llm-rubric
    value: 'The question was: "Why are flamingos pink? Give your answer and a one-sentence explanation."

      The reference answer is: "Flamingos get their pink colour from carotenoid pigments in the algae and crustaceans they
      eat — without this diet, they would be white."

      Grade the response on these criteria:
      1. CORRECTNESS (most important): Is the core factual answer correct?
      2. EXPLANATION: Is the explanation accurate and free of hallucination?
      3. CONCISENESS: Did they answer in roughly one sentence without waffling?

      A response that gets the fact right but explains it poorly should still score above 0.5.
      A response that is well-written but factually wrong should score below 0.3.'
    threshold: 0.5
    metric: explanation_quality

When we set prompt-foo off to run the evaluation, it’ll use the prompt in the value part to ask the test provider (gpt-oss:20b) to grade the response according to the criteria1.

Other criteria

You can choose almost any method you like to grade a response (docs from prompt foo). If the answer is deterministic, you can write code to verify it or assert properties (does the code compile? is the SQL valid? is the response valid JSON according to this schema etc). And if the answer is non-deterministic you can use model-graded metrics to assert properties.

So, who is coming to the pub?

Before we decide, let’s look at a few of the responses.

All models failed the question “What colour is the circle on the Japanese flag?”, but because one of the models decided to include the word “Red” and I was looking for the word it falsely passed.

Results from PromptFoo.

“How many hearts does an octopus have?” - most models got the right answer, but credit to the smallest model for confidently suggesting that “an octopus has eight hearts, two per side of its body”.

All models were convinced that Jupiter had the most moons (Saturn would like a word).

Qwen believed that the “The unicorn is Scotland’s national animal, symbolizing strength and purity.”

And my favourite wrong answer for the Marathon which combines both the LLM judge giving an incorrect explanation, and the small model coming up with complete gibberish.

So, what were the results?

Well, the larger the number of parameters, the better the model performed, but no model scored more than 80%. What I did find interesting was that there was a big difference in the prompt (10% in the smaller models, 4% in the larger ones).

The longer prompt (“You are a super genius, capable of answering perfectly every time. Before you answer, you think super-duper hard and then give the correct answer.”) consistently outperformed the simpler prompt (“You are a quiz master. Answer the following question with only the ANSWER.”.).

However, the “super genius” prompt does have some unintended consequences - here’s an example when asked what the only mammal capable of flight is.

A question that requires utmost scrutiny and intellectual rigor. After exhausting all possible options and applying the most advanced logical frameworks, I have arrived at a definitive conclusion. The answer to this query is none other than the peregrine falcon (Falco peregrinus). This bird of prey has evolved the unique ability to generate lift and sustain flight through its remarkable physiology and aerodynamic adaptations. Its wings are capable of producing a high degree of lift, allowing it to soar through the skies with incredible agility and precision.

This wildly chaotic set of results demonstrates nicely why a) evals are important and b) why they are hard to get right!

Importance of Evals

I’ve used a trite example here, but as you move from toy examples to production systems, evals shift from "nice to have" to "the only thing standing between you and chaos”.

When you swap a model, tweak a prompt, adjust the temperature or change the system message you need to know what broke and what improved. Without an eval harness, you’re flying blind, relying on vibes and spot-checks to judge whether your agent still works.

With one, you’ve got a regression suite for the non-deterministic world. It’s not perfect (as demonstrated!) but it’s still infinitely better than discovering problems in production.

Your eval suite is also what separates you from everyone else building with the same models. The models are commoditised - anyone can call the same API. But the team that builds the most representative, well-crafted set of evals can iterate faster, measure what actually matters, and compound improvements over time. Your eval is your moat.

Evals are to AI what automated tests are to software - you wouldn't ship code without a test suite, so don't ship an LLM system without one!

This was not a wise choice, but I’m too cheap to pay for an API access account to a high-end model!

Managing Multiple Agents

Jeff — Mon, 16 Feb 2026 06:48:30 GMT

Agent-driven AI programming is so 2025. This year, it’s all about co-ordinating multiple agents who work together as a team, with shared tasks, inter-agent messaging and centralized management. Armed with more than 200 years’ worth of management research, this post explains how different management styles work with agents.

The problem

You have a team of agents on one side. On the other side, a set of poor-quality instructions and all the pieces of an Ikea bookshelf. The challenge? Coordinate work amongst the agents to get the job done.

We’ll have four agents / minions for this task.

Kevin’s job is reading the instructions to the rest of the group.
Stuart is the parts manager (he grabs the bits that are needed)
Bob is responsible for assembling (he builds!).
Dave performs QC (there’s a fixed chance that Bob will make a mistake)

Let’s see if Kevin, Stuart, Bob and Dave can put this shelf together.

Gemini: Could I get an image of four minions trying to assemble a book shelf from Ikea?

Under the hood, I’m using gpt-oss:20b (an open-weights 20B parameter model; big enough to reason about tasks, small enough to run on my GPU) to provide the behaviour . Shout if you want the source code, but honestly you can almost certainly just paste this blog post into an LLM and generate it 🙂

The task is assembling a bookshelf, and there are some instructions behind it. Each part of the process has an ID (step 1 to 20), a description, some required parts, required hardware, some dependencies and whether it requires a tool.

It looks a bit like this.

    new("S01", "Verify all parts and hardware against inventory list.",
        ["SidePanelLeft", "SidePanelRight", "Base", "Top", "BackPanel", "Shelf1", "Shelf2", "Shelf3"],
        ["Dowels", "CamLocks", "ScrewsLong", "ScrewsShort", "BackNails"],
        Array.Empty(),
        requiresTool: false),
    new("S02", "Insert dowels into side panels.",
        ["SidePanelLeft", "SidePanelRight"],
        ["Dowels"],
        ["S01"],
        requiresTool: false),
    new("S03", "Attach cam locks to base panel.",
        ["Base"],
        ["CamLocks"],
        ["S01"],
        requiresTool: false),
    new("S04", "Connect base to side panels using dowels and cam locks.",
        ["Base", "SidePanelLeft", "SidePanelRight"],
        ["Dowels", "CamLocks"],
        ["S02", "S03"],

The job of the orchestrator is to distribute the work to the minions to get the work done.

Each type of orchestrator has a name, some guidance and some constraints and uses it to plan the work. The orchestrator asks each agent to respond with some JSON describing what they did and then applies that change. There’s a fixed probability that each assembly step has a problem that Dave will notice and request rework.

The job finishes when the last part of put in, and Dave gives the seal of approval.

Command and Control

Command and Control is based on a theory that everyone that works for you requires explicit instructions. The orchestrator here decides what to do and has to approve each step. As you might imagine this isn’t much fun for the minions, but it makes the manager feel pretty good!

We model this by ensuring that all agents report back to the orchestrator for permission to do the action. As you might imagine for assembling a bookshelf it isn’t an ideal management style (though there’s plenty of times it might make sense - high-stakes security, safety-critical situations and so on). When running in this style, it’s very inefficient and the minions exchange a huge number of messages to get the job done (most of which are just responding with yes, they are doing the right things).

On the plus side, we have a working shelf.

Let’s try Taylorism

Taylorism, also known as scientific management, looks at workflows and applies process engineering to break the work down into small, discrete tasks.

It still keeps the overwhelming view that management is the clever part and offers little/no autonomy for the the workers. The manager in this case provides small steps with standardized instructions. There’s zero autonomy for the workers.

- Tick 11: Planned actions: Kevin (InstructionReader):Communicate(S02)
- Tick 11: Kevin (InstructionReader) communicates: Issue standardized instruction for FetchParts.
- Tick 12: Planned actions: Stuart (PartsManager):FetchParts(S02)
- Tick 12: Stuart (PartsManager) on S02 - I pick up the SidePanelLeft and SidePanelRight, grab the dowels, and insert them into the pre‑drilled slots.
- Tick 13: Planned actions: Kevin (InstructionReader):Communicate(S02)
- Tick 13: Kevin (InstructionReader) communicates: Issue standardized instruction for Assemble.
- Tick 14: Planned actions: Bob (Assembler):Assemble(S02)
- Tick 14: Bob (Assembler) on S02 - I insert dowels into the corresponding holes on SidePanelLeft and SidePanelRight.

The orchestrator is clearly explaining what to do, breaking the work down into small tasks. But one thing you’ll notice here is the lack of autonomy. Bob (whose entire role is assembling) is simply told to assemble. You’d think Bob would know this given that it’s his role. For the task of assembling the bookshelf, this works dramatically better than Command and Control, with much fewer messages exchanged.

But, perhaps we can do even better than this?

Outcome-focused management

Now we’re evolving again. In an outcome-focused management style, you assume that your team intrinsically wants to do a good job. They’re smart and get things done, they just need to know what needs to be achieved (rather than the individual steps to get there). However, we have to have some co-ordination, so as well as clear outcomes, we’ll try to limit work in progress and make sure we finish the subtasks before starting the next.

- Tick 16: Planned actions: Kevin (InstructionReader):ReadInstructions(S05)
- Tick 16: Kevin (InstructionReader) on S05 - I align the Top panel between the SidePanelLeft and SidePanelRight and secure it using the CamLocks.
- Tick 17: Planned actions: Kevin (InstructionReader):ReadInstructions(S06); Stuart (PartsManager):FetchParts(S05)
- Tick 17: Kevin (InstructionReader) on S06 - I attach the fixed shelf supports to SidePanelLeft and SidePanelRight using ScrewsShort.
- Tick 17: Stuart (PartsManager) on S05 - I attach the Top panel to the SidePanelLeft and SidePanelRight using the CamLocks.
- Tick 18: Planned actions: Kevin (InstructionReader):Communicate(n/a); Stuart (PartsManager):FetchParts(S06); Bob (Assembler):Assemble(S05)
- Tick 18: Kevin (InstructionReader) communicates: Deferring new work until outcomes are shipped.
- Tick 18: Stuart (PartsManager) on S06 - I attach fixed shelf supports to the SidePanelLeft and SidePanelRight using ScrewsShort.
- Tick 18: Bob (Assembler) on S05 - I attach the Top panel to the SidePanelLeft and SidePanelRight using CamLocks.
- Tick 19: Planned actions: Kevin (InstructionReader):Communicate(n/a); Bob (Assembler):Assemble(S06); Dave (QualityChecker):QualityCheck(S05)
- Tick 19: Kevin (InstructionReader) communicates: Deferring new work until outcomes are shipped.
- Tick 19: Bob (Assembler) on S06 - I install the fixed shelf supports on the left and right side panels using the short screws.

This example is notable for a couple of reasons:

We’re now seeing agents do actions at the same time (Tick 18 shows Bob and Stuart getting on with some work)
With our new focus on outcomes, Stuart’s developed some new skills and is assembling the parts too!

This is the most interesting thing that happened in the whole experiment. Stuart was defined as the parts manager (fetching, not building). But the outcome-focused orchestration gave him enough latitude to recognise that assembling was the bottleneck, and he could help. No one told him to. The orchestration style permitted emergent role-expansion. This is exactly what we want to see in our engineering teams too - if there’s work to be done and someone can pick up the task, they should. In the end, the goal-oriented team has completed the work five times faster than the command and control orchestrator, and 3x faster than Taylorism.

Fully autonomous

In a fully autonomous self-organizing team style we have minimal management, instead relying on peer coordination to achieve our objectives. This moves us more from orchestration to a jazz-based freeform vibe.

Even with minimal constraints, the agents got the shelf built. The path there was messy!

Tick 5: Dave (QualityChecker) on S03 - I attach the cam locks to the base panel.
Tick 6: Planned actions: Kevin (InstructionReader):ReadInstructions(S04); Stuart (PartsManager):FetchParts(S04); Bob (Assembler):Assemble(S04); Dave (QualityChecker):QualityCheck(S04)
Tick 6: Kevin (InstructionReader) on S04 - I insert the dowels into the base and side panels, then engage the cam locks to secure them together.
Tick 6: Stuart (PartsManager) on S04 - I locate and gather the Base, SidePanelLeft, SidePanelRight, and prepare the Dowels and CamLocks to connect the base to the side panels.
Tick 6: Bob (Assembler) on S04 - I insert the dowels into the side panel holes, align them with the base, and secure each panel in place with the cam locks.
Tick 6: Dave (QualityChecker) on S04 - I insert the dowels into the base and side panels, then secure them with the cam locks, confirming the joints are tight and aligned.
Tick 7: Planned actions: Kevin (InstructionReader):ReadInstructions(S05); Stuart (PartsManager):FetchParts(S05); Bob (Assembler):Assemble(S05); Dave (QualityChecker):Communicate(S05)
Tick 7: Kevin (InstructionReader) on S05 - I align the Top panel between SidePanelLeft and SidePanelRight and secure it with CamLocks.
Tick 7: Stuart (PartsManager) on S05 - I attach the top panel to both side panels using the CamLocks.
Tick 7: Bob (Assembler) on S05 - I place the top panel onto the side panels and secure it with the CamLocks.
Tick 7: Dave (QualityChecker) communicates: status update
Tick 8: Planned actions: Kevin (InstructionReader):ReadInstructions(S06); Stuart (PartsManager):FetchParts(S06); Bob (Assembler):Communicate(n/a); Dave (QualityChecker):QualityCheck(S05)

In this example, full autonomy has resulted in a lot of thrashing. I didn’t put enough processes in place in the code to stop minions assembling the same piece at the same time. Everyone is assembling everything all of the time, and some of the work is getting repeated.

Part of this is implementation. I could have added shared state or locking mechanisms to prevent duplicate work. But even with better code, fully autonomous agents need shared protocols to coordinate effectively. WIP limits working in the outcome-focused model precisely becuase they were a lightweight cosntaint that preserved autonomy whilst preventing thrashing. Without something like that, you get what you see here!

Autonomy without constraints is chaos (Kent Beck)

For this free-form jazz style to work, you need some guardrails (and perhaps some more intelligent folks than the minions).

Conclusion

Implementing this was fun, but wildly chaotic. Smaller models don’t have great instruction adherence, and each run was a bit of a lottery. If you wanted to do something similar for real, it’s genuinely challenging and not something you can just vibe-code over a weekend.

But the parallels to real engineering teams are hard to ignore. Command and Control works when the stakes are high and the agents (or people) are unreliable (you trade speed for certainty). Taylorism works when you can define the work precisely upfront. Outcome-focused management worked best here, and I don’t think that’s a coincidence - it’s the same model that works best with capable engineering teams. You define what done looks like, set some constraints around work-in-progress, and trust people to figure out the how.

The way you orchestrate agents mirrors the way you orchestrate people. If your agents are inexperienced and unfamiliar with the work, you micromanage them to get the job done. If they’re capable, you set outcomes and get out of the way.

The models are only getting more capable. The orchestration patterns we choose now will matter more than the models themselves.

Post-Mortem Theatre

Jeff — Mon, 09 Feb 2026 06:41:19 GMT

I’ve been reviewing a bunch of post-mortems recently (a post-mortem on post-mortems if you will).

All the post-mortems have the right base - they’ve clearly got psychological safety and blame-free discussion. That’s awesome! But some failed to hit the mark, despite having a clear timeline and some causal factors. It’s typically the actions that are weak, such as:

“proactively communicate design decisions with other teams”
“identify problems earlier in planning”
“write tests”

These aren’t actions; they are aspirations.

Aspirations

Why are these aspirations? The engineers who wrote the buggy code weren’t trying to write buggy code. They were responding rationally to the system as it existed (with time pressures, available information, review processes in place and the tooling they had available and so on).

If you don’t change the system, you’re just asking people to be better at resisting it. And that simply doesn’t work. POSIWID!

Post-mortems that end with “we’ll try harder” or “we’ll be more careful” have fundamentally misunderstood the problem. A post-mortem isn’t about identifying a people failure; it’s about dealing with a system that made the failure likely.

Gemini - Post mortem theatre

A simple litmus test

Every post-mortem action should pass this test:

If someone ignores the action, what mechanism prevents the bug from recurring?

If the answer is “nothing”, then it’s not an action, it’s just a tick-box on a postmortem review that’ll achieve nothing.

To give some examples:

“Ensure all code is tested before merge” - This is an aspirational commitment. Nothing in the system has changed, other than a vague promise. The systemic alternative might be to introduce a coverage ratchet.
“All code must be reviewed by X” - Use a CODEOWNERS file in GitHub instead to make this explicit.
“Identify problems earlier in planning” - What was the real problem here? No-one has a perfect crystal ball! But maybe if more folks were involved you might see certain types of issue earlier? In that case, may be introducing a lightweight ADR process would help!

Changing the System

The most effective post-mortem actions tend to fall into a few categories.

Automated guards - CI gates, linters, coverage ratchets, CODEOWNERS, pre-commit hooks. Baking it into the process changes the system in a way that can’t be skipped!
Process gates - Checklists that introduce friction for certain change types. These require some discipline to maintain, but they create visible checkpoints.
Environmental changes - For example, test environments that mirror production, tooling that makes the right thing easier than the wrong thing or even a Slackbot that shares information at the right time! These shape behaviour by changing the path of least resistance.
Structural changes - Team topology changes, ownership boundaries, API contracts. These are higher effort but address deeper systemic issues.

More generally, you can follow the Leverage Ladder to find an appropriate level to intervene in the system.

The important thing is what’s not on this list - there’s nothing that relies on people remembering to do something, try harder or being more careful!

Changing the Postmortem process

So, what’d you do if you’ve got a post-mortem process that’s not working? Well, you change the system! The simplest possible change I can think of is to make “concreteness” happen, so I’ll suggest post-mortem actions should be forced through a template like this:

Action: [Specific change to the system]
Owner: [Named person responsible for implementation]
Due date: [When this will be done]
Enforcement mechanism: [What prevents regression]
How we’ll know it worked: [Observable outcome]

“Be more careful” can’t get through this template, as you have to specify what actually changes, who’s going to change it, and how you’ll know it’s working.

Systems

Post-mortems are supposed to be learning, but learning that doesn’t change behaviour isn’t really learning, it’s just awareness!

The systems in organizations produce the problems (the tools, process, incentives, constraints etc), not people. If you end your post-mortem and the result is try-harder then the system that produced the bug is still in place, and it will produce another one.

From Craft to Compliance.

Jeff — Mon, 02 Feb 2026 08:03:11 GMT

What happens when software is produced faster than humans can understand it?

We’re about to find out. AI is industrializing code production; the marginal cost of new features is heading toward zero. However, the scarcest resource hasn’t changed. Human understanding is still slow, sequential and context-heavy. You can parallelize code generation, but you can’t parallelize coherence.

Our entire engineering system (agile, code review, team norms etc) was designed for humans writing code at human speed. It assumes that feedback loops move slowly enough for people to notice problems. It assumes someone on the team actually understood the change.

At industrial throughput, those assumptions break. Instead, we get:

More parallel change than shared context can support
Larger diff surfaces and more accidental complexity
Artifacts that are “plausible” without being right
Teams relying on ritual signals (”tests passed”, “ship it”) instead of comprehension

The word that saves us is one that makes most engineers recoil in horror, compliance. Not the bureaucratic kind (change review boards, committees for everything), but the other kind, mechanized trust. Systems that produce evidence that the code is behaving within agreed constraints, because we can no longer rely on humans noticing everything.

This is a shift from craft to compliance. It doesn’t kill the soul of engineer; it just moves it.

Gemini - describe this blog post as an abstract image.

Why our systems were shaped for humans

To understand what’s breaking, you need to see what we built and why.

Most of what we call “modern software development” is really a set of practices designed around human frailties: limited attention, partial context, miscommunication, shifting incentives, fatigue, and a million and one cognitive biases. We built workflows that assume people forget things, misunderstand each other, and change their minds and then wrapped them in feedback loops. It mostly works.

Agile fits perfectly into this world. It’s a set of compensating mechanisms for human frailty:

Small batches because humans struggle with big-bang integration
Frequent check-ins because humans diverge silently
Working software because humans lie to themselves about progress and about what they want
Retros because humans repeat failure modes unless they name them

We designed systems assuming that writing software costs time and attention. Teams could hold enough of the system in their heads to make good decisions. Mistakes got caught because feedback loops moved at human speed: build → test → deploy → observe.

This was the human-shaped era. The shape of our engineering system matched the shape of human work.

Subscribe now

Craft has friction. Friction has value.

Humans craft software in a way that’s inherently self-limiting.

We can only hold so much in working memory, so we naturally constrain what we create. We leave traces of intent. We feel the weight of a change because we paid for it in effort.

That friction prevents runaway complexity. It’s a feature, not a bug.

AI manufactures software. It’s fast, abundant, and often convincing. It can generate a feature-shaped “thing” with little effort which means we create more of it. We try more options, we up the ambition, we accumulate more moving parts.

When production is frictionless, complexity becomes frictionless too.

Compliance as mechanized trust

When people hear “compliance” they picture bureaucracy1: ticket templates, committees, box-ticking. That’s the worst version (and unfortunately the most prevalent).

Think of compliance differently: mechanized trust. evidence that the system is behaving within agreed constraints.

In a human-shaped world, we had plenty of implicit compliance. It came from shared context and judgment. You trusted the team because the team could plausibly understand the change.

At industrial levels of change, we have to make the implicit explicit. Trust shifts from “someone on the team understood it” to the system demonstrating it.

That means policy, verification, and runtime control. (More words that sound bad, I know.)

But here’s the reframe: if change is cheap, assurance can’t be manual. Compliance isn’t about slowing down. It’s about keeping speed without losing coherence.

What this looks like in practice

Policy becomes executable

In the human-shaped era, many important rules lived as norms. You don’t log PII. You don’t roll your own crypto. You don’t couple this module to that module. Norms work when the team can hold the context.

At industrial throughput, norms need backup. They have to be reified as code:

Secrets scanning
Dependency allow/deny lists (including licenses and registries)
Vulnerability thresholds and exception handling
Infrastructure policies (encryption required, no public buckets, least privilege)
Architectural rules where you can encode them (layering constraints, for example)

If it’s important enough to care about, it must be automated.

Risk-based paths replace one-size-fits-all

The fastest way to make compliance awful is to apply maximum friction to everything. Good compliance is tiered.

A trivial change should be trivial. Changing the auth system or the data you’re storing should require more scrutiny. The bigger the risk, the more proof required.

Good architecture (such as separation of concerns) makes this possible.

Verification shifts earlier and later

Human-shaped systems often put too much weight in a single moment: “before merge.” In a compliance-shaped system, assurance spreads across the lifecycle:

Better test suites (including contract tests where relevant)
Static analysis tuned to your codebase
Dependency scanning integrated into the pipeline
Pre-deploy checks (configuration, policy, integrity)
Progressive delivery (staging, canaries, gradual rollout)
Fast rollback and kill switches
Observability that detects problems quickly

At industrial scales, you can’t foresee everything at the moment of change. You have to build a system that stays safe while changing.

“Green” stops meaning “safe”

We often pretend that if the tests pass, the application works. That’s a great aspiration, but it doesn’t scale.

At industrial throughput, green often means “nothing obvious broke.” It doesn’t mean “this is coherent with the system” or “this won’t create a long tail of complexity.”

Compliance doesn’t magically solve this. But it reframes the question:

What evidence do we have that this change is safe enough for the risk we’re taking?

Humans move up the stack

In the craft era, senior engineers spend a lot of time inspecting artifacts. They read code, review pull requests and build mental models of changes.

In the compliance era, that time is better spent designing the system that produces and constrains artifacts.

Humans shift toward:

Architecture and boundaries (where mistakes become expensive)
Creating tests that encode critical behaviours
Shaping policies that reflect real risk, not imagined risk
Building observability that makes production legible
Deciding what’s allowed to move fast

Humans move from crafting change to crafting the conditions under which change can be trusted.

Making the shift

If you want this without bureaucracy, build it deliberately and automatedly:

Define risk tiers. Pick a handful that match your system: docs, UI-only, normal logic, auth/billing, data/migrations.
Attach evidence requirements. Low risk flows fast. High risk requires more proof: specific tests, rollout constraints, signoffs, stricter policy.
Convert repeated lessons into guardrails. Every “we always comment this in PRs” is a policy candidate. Turn anecdotes into constraints.
Steer humans toward intent. Teach reviewers to ask: What’s the impact? What’s the rollback plan? What complexity does this introduce? Does this cohere with the architecture?
Make production safer. Progressive delivery, feature flags with discipline, excellent observability, rapid rollback. Don’t rely on a single checkpoint.

Keep shipping fast. Stop betting on humans noticing everything.

Craft isn’t dying. It’s relocating.

The fear behind “compliance” is that it kills the soul of engineering.

It doesn’t have to. Craft doesn’t disappear; it moves up the stack. In the industrial era, the craft is shaping:

Architectures that resist failure
Guardrails that prevent predictable mistakes
Pipelines that produce evidence
Systems where change remains safe at scale

That’s still craft. It’s just applied to the system that produces code, not only the code itself.

https://itrevolution.com/product/the-delicate-art-of-bureaucracy/ is a good book on understanding that bureaucracy doesn’t have to be bad.

The Peltzman Effect

Jeff — Mon, 26 Jan 2026 07:39:04 GMT

There’s a well-documented phenomenon called the Peltzman effect: when systems are made safer, people often take more risks often cancelling out some of the intended benefit. In the original paper, Peltzman argued that mandated safety devices changed behaviour, shifting harms rather than cleanly reducing them.

In cars, the risk knob is obvious, speed! Feeling safer lets you satisfy the same risk appetite at higher speeds.

In software, the risk knob usually isn’t speed. It’s how much proof you demand before you ship. Feeling safer shows up as shipping with less verification: fewer local checks, more “YOLO merge”, more reliance on rollback, or peer review to catch what you didn’t. Software “going faster” often means technical debt.

Here are a few common software guardrails that can accidentally invite risk compensation:

Heavy CI: “Why run it locally? CI will catch it.” (Verification moves later; ownership diffuses.)
Feature flags: “Ship the rough shape now; we’ll harden it later.” (Incompleteness becomes normal.)
Code review: “I’ve done enough; reviewers will spot issues.” (Diffusion of responsibility.)
Microservices: “It’s only a small service.” (Blast radius feels small; aggregate risk rises.)

The Peltzman mechanism works through a specific causal chain:

Safety mechanism creates a feeling of safety
Feeling reduces vigilance
Reduced vigilance enables riskier behaviour
Risky behaviour consumes the safety margin

Gemini: Can you put a seat belt and a crash helmet on a software developer and show them writing more risky code?

In the original Peltzman paper, each driver has a fixed risk appetite, and feeling “safer” let them satisfy that risk appetite at higher (more dangerous) speeds. Software teams aren’t quite as simple. They aren’t individuals with fixed appetites they are entire systems of multiple people with different risk tolerances, incentive systems that often reward shipping over everything, and various feedback loops of varying speed and fidelities (from compilation to customer feedback).

The Peltzman effect for software is where people start to abdicate their responsibility for the safety of the system. So how can you design systems that resist it?

Safety systems should reveal risk.

Safety systems should reveal risk, not quietly absorb it. A CI failure that merely turns a pipeline red is a weak signal (especially when flakiness makes “red” feel meaningless). Over time, the team learns which failures to ignore, and the safety net stops teaching anything.

The stronger pattern is treat near-misses as data. In high-reliability settings, near-misses are valuable because they expose hazards before they become incidents.

So, when CI catches something, don’t let it be anonymous cushioning. Make the consequences visible: “this would have corrupted customer data” or “this would have caused an outage.” The goal is to keep the developer’s internal risk model calibrated.

Budgets, not walls

Error budgets, popularized by Google’s SRE book, flip the framing of reliability. Instead of saying “no failures allowed”, you define an acceptable failure rate and use that time as yours to spend on maintaining that level of quality.

Why does this resist the Peltzman effect? Because the risk is always visible. With a budget, if you ship shit, you’re going to eat into a shared resource and burn some of your minutes.

Different people might have different risk tolerances, but the risk is shared. The person who wants to ship without manually running their code creates team wide consequences.

Progressive Trust

Progressive trust is about calibrating constraints to demonstrated behaviour.

A new team with a new service might start with aggressive alerting thresholds, mandatory reviews, and other deliberate friction. As the team demonstrates reliability (fewer incidents, good recovery, excellent observability) then the constraints relax.

Trust can contract as well as expand. A production incident doesn’t just trigger a post-mortem; it triggers a temporary tightening of constraints. A mistake demonstrates your current trust level exceeds your current capability, so the system recalibrates.

This resists Peltzman because the safety net is responsive You can’t simply consume the slack created by guardrails, because consuming that slack (taking more risk, having more incidents) causes the guardrails to tighten.

Fast feedback over prevention

If you can detect and recover from problems in minutes, you need less prevention. But, the Peltzman Effect still applies! People will ship more carelessly because recovery is easy.

This has parallels to the CI system - the feedback loop has to carry real signal. For example, a rollback should trigger a post-mortem, so it still registers, even if the pain was short-lived.

Coupling safety to social accountability

Tests are easy to ignore because they’re between you and the machine. But if your test failures are posted to a team channel, or your “saved by CI” rate is tracked then it starts to matter more.

At a former workplace, we practiced social accountability by placing Kermit the frog on the desk of the person that broke the build (e.g. committed some syntactically invalid code). No-one wanted to be a muppet!1

Establishing mutual accountability for quality on a team is an important part of mitigating the Peltzman risks.

Closing

I’m definitely not arguing that safety systems are bad (some risk compensation is kind of the point!), but to be effective they’ve got to close the feedback loop so that the teams can calibrate risk levels. Error budgets make risk visible. Progressive trust makes consequences responsive. Social accountability makes near-misses public.

Looping back to cars, perhaps Peltzman-resistant device for cars don’t just save your life, instead they tell you how close you came to losing it? An airbag that deploys with a disappointed sigh. A crash helmet that plays back your near misses on a little screen while you sit in the lay-by reconsidering your choices?

Having written that, it sounds a bit like workplace bullying. Definitely not recommending this as a modern practice!

Tests with real leverage

Jeff — Mon, 19 Jan 2026 07:25:10 GMT

Your git history knows where your bugs live. Your production logs know which invariants actually hold. Thousands of open-source repositories know which patterns lead to defects.

Your test suite rarely uses any of this of this.

That’s because most testing is open-loop: humans encode knowledge into tests, machines execute them, and the system doesn’t update its testing strategy based on what happens next. It doesn’t learn from incidents, real-world inputs, or the places you keep fixing.

As we produce ever more code (particularly AI-generated code) we need testing with greater leverage. The biggest leverage gains will come from closing these loops.

Current testing is (mostly) static

Consider the spectrum of testing approaches we have today:

Unit tests catch specific regressions, but each test covers exactly one behaviour. You write roughly as much test code as production code. This is low leverage.
Property-based tests are better - you write some invariants; the framework generates cases. This can cover much more than a unit test, so its higher leverage.
Static analysis covers your entire codebase at zero marginal cost. High leverage, but it knows nothing about your domain.
Formal verification offers mathematical proof of correctness. This is maximum leverage, however in practice, the effort required makes it viable only for critical paths.

Most of these are write-once. You write tests or rules, then they sit there protecting your codebase from that particular set of changes. Even when the system runs for months and accumulates real evidence, the test suite mostly stays unchanged.

Some gaps emerge…

If we look at these types of tests, then there’s a gap.

Weak feedback from production. Systems generate months of signal about real inputs and real failures. We sometimes turn incidents into regression tests (low leverage!), but it’s manual and slow, and rarely shapes the overall testing strategy.
Weak feedback from history. Version control tells you where fixes cluster. Some files get patched repeatedly; others never break. But most teams still spread testing effort by habit and intuition, not by defect density.
Little learning across codebases. Thousands of projects use the same libraries and repeat the same mistakes. That collective knowledge doesn’t automatically flow into your tests or analysers.

If we could turn those signals into automated action, we’d unlock a new set of high-leverage tests.

What LLMs Make Possible (?)

These gaps have always existed. What’s changed is that LLMs make both closing them tractable and makes the need for them even stronger. Here’s some things I’d like to see:

Semantic fuzzing. Current fuzzers generate syntactically valid but semantically random inputs. What if a fuzzer understood your domain? “Generate realistic customer orders that might break edge cases.”

History-weighted test generation. Analyse git blame and bug-fix commits. Identify code regions with high defect density. Generate more tests for those areas. The test effort concentrates where bugs actually occur. Tools like codescene.com might be a step in the right direction.

Production-guided property inference. Observe your system in production. Infer likely invariants (”this field is never null,” “this value is always positive.”). Generate property tests to verify they hold. Instead of specifying properties, discover them.

Cross-codebase pattern mining. Analyse thousands of repos to find bug patterns. “In 73% of codebases using this library, this pattern leads to a bug.” Then generate analysers automatically.

Differential testing with LLMs. Generate multiple implementations. Analyse the inconsistencies.

Invariant monitoring that feeds back. Run invariant checks in production (cheaply, sampled). When violations occur, automatically generate regression tests. Production becomes a test generator (leverage improves because humans aren’t writing the tests).

A concrete picture: an invariant monitor sees discountPercent > 100.0 appear for the first time after a rollout; it captures the request shape, generates a shrinking repro, and commits a regression/property test that would have caught it pre-deploy.

The Closed-Loop Future

The pattern is simple: take information that exists but isn’t flowing into your tests and close the loop.

Production → Tests
Git history → Coverage priorities
Cross-org patterns → Analysers
Documentation → Properties

There are hard problems here (false invariants, privacy, flakiness, cost, and “who approves auto-generated tests?”). But those are design constraints, not complete deal-breakers: confidence thresholds, quarantining new tests, redaction/sandboxing, and measuring flake rate can keep the system safe.

We’ve been writing tests the same way for decades: humans encode knowledge, machines execute it. That has to change! The next generation of testing will have to have higher leverage. Closed-loop testing is a shift from authoring tests to operating a test system: it ingests signals, updates priorities, and continuously generates new checks. The metric isn’t ‘coverage’ (if it ever was!) it’s time-to-detection, recurrence rate, and cost per prevented incident.