Exploration vs. Exploitation

Multi-armed bandits!

Oct 14, 2025

You’ve gone to Las Vegas. You brave the gambling floor and quickly pick a slot machine. You start winning (and not just free drinks!). Do you keep pulling the lever on this machine, or try something else that might have a better reward?

This is the multi-armed bandit problem: balancing exploitation (keep playing the best-looking machine so far) against exploration (try others to learn if something beats it). Exploitation can lock you into a sub-optimal arm; pure exploration wastes resources learning what you won’t use. The optimal balance depends heavily on your time horizon and how fast the environment changes.

Why Companies Systematically Under-Explore

As companies grow, they typically under-explore because structures that amplified prior success now bias towards repeating it.

Revenue cycles are quarterly - exploitation shows results in Q3, exploration might pay off in 2027
Career cycles are 2-3 years - leaders may rotate before exploration pays dividends
Investor expectations are immediate - markets punish exploration that depresses short-term growth
Promotion incentives favour shipping - you get promoted for features delivered, not options preserved
Risk is asymmetric - failed exploitation is defensible (”we tried to ship”), failed exploration is career-limiting (”you wasted 6 months on nothing”)

The end result is a system where folks rationally choose exploitation because the benefits of exploration accrue to future-you (or more likely, someone else entirely). Organizations optimize for the decision-makers’ time horizons, rather than the needs of the company.

Exploration deficit is structural, not a vision failure.

The Regret Minimization Framework

In classic bandits we minimize cumulative regret. Translating this to organizations gives us two modes of regret.

“We should have explored” - Delayed and silent (e.g. competitors shipping what you didn’t explore).
“We should have shipped”- Immediate, and loud (e.g. customers complaining, or revenue targets missed).

Over time, good policies help you shift towards higher payoffs without turning exploitation off. In markets that aren’t stationary (like software!), you have to keep a non-zero exploration rate permanently.

How much to explore? There’s obviously not a one-size fits all answer, but some factors you could consider include:

How much you know about the domain (more uncertainty means more exploration)
How quickly the environment changes (slower change, more exploitation)
How long are you playing? (longer horizons means more room for exploration).

I’d argue that most software engineering (at least for product companies) sits in a high-uncertainty, fast-changing, long-horizon space. The optimal rate for exploration is almost certainly much higher than most companies allocate. I’d guess the original 20% time as originally conceived from Google was in the right ballpark for early-stage work in a fast-moving uncertain environment.

Bandit Algorithms like ε-greedy or Upper Confidence Bound (UCB) offers conceptual parallels to possible innovation strategies:

ε-greedy - allocate a small, fixed percentage (say 10%) of capacity to explore new bets.
Upper confidence bound - Staged bets. Use small teams with short gates for exploration, scale investment if the signal is good or the uncertainty remains high.

Force Exploration Through Institutional Structure

Create an Innovation team (ε-greedy)

Create an organizational unit with an explicit mandate to learn and create options. Think “unruly pirates”, not a delivery team!

Freedom to pick tech for the job (org capability is not a constraint)
Measured on learning, options created and decision made, not velocity or roadmap burn!
Protected from roadmap items (this ain’t a delivery team)

Exploration is about testing hypothesis, learning and options created. Sometimes success is measured by failure!

Time-Boxed Exploration Sprints (ε-greedy)

A time-boxed hack week can be a great way of exploring items you wouldn’t normally. In the past, Redgate Down Tools Week has been a great way of fuelling innovation, and the idea of a FedEx day to deliver innovation has been taken up by many other companies.

These events give permission for everyone to explore and makes this mode of thinking the default (even if only briefly!). Hack weeks work because they remove the coordination cost of “should I explore or should I ship?” and give you the answer!

Thinking in terms of bets (UCB)

Ask your teams to think in terms of bets with staged funding. Start tiny, measure early signals, and scale or kill fast. Treat each bet as a real option; small investments buy information and the right (not the obligation) to scale when uncertainty resolves.

In practice, this means:

Start new initiatives with small teams (2-3 engineers) and short timeboxes
Measure early signals: user engagement, technical feasibility, team enthusiasm
Double down on projects that show promising metrics OR where the opportunity space remains unclear but large
Kill projects that show both poor results AND high certainty (you’ve learned they won’t work)

Amazon’s “working backwards“ process embodies this approach. Teams write a press release and FAQ before building anything, then get incremental funding based on customer feedback and early prototypes.

UCB tells you to keep exploring high-uncertainty spaces even if early results are mediocre, because the potential upside combined with your uncertainty makes them valuable to investigate.

But humans aren’t slot machines…

This analogy to algorithms breaks down because of pesky humans. Humans have switching costs. Moving from exploitation to exploration requires a change in thinking, with different risk tolerances, different time horizons and different success criteria.

It might be that not all humans are suited to all types of work. Simon Wardley explores this in “On Pioneers, Settlers, Town Planners and Theft”. If you ask an engineer to flip between the two modes of working, you’re likely to create the worst of both worlds. It takes time to explore those modes of operation; avoid context-switching!

Governance for exploration

Exploration is measured differently that exploitation. The measures for this are (as always) context dependent, but broadly I’d put them into two categories:

Is the innovation engine working? You might get insight by looking at “time to confidence” or “kill rate by gate”. A healthy innovation engine is always improving confidence or killing ideas.
Is it sharing valuable insights? How long does it take for new evidence and learning to be reflected in organizational plans?

The tl;dr

Building software is an infinite game, but all the (default) forces point towards immediacy and this structural tension guarantees a bias towards exploitation. Shouting “be more innovative” doesn’t help!

Instead, institutionalize exploration:

Commit ε = 15–20% exploration budget (decay by team as knowledge climbs).
Allocate via staged bets (bets buy information; not obligation).
Measure options created, time-to-confidence, kill rate; not velocity.

Focusing solely on shipping is a sacrifice that removes the optionality that creates long-term value.

JoT

Discussion about this post

Ready for more?