Resiliency Engineering·November 17, 2025·9 min read

Chaos Engineering Is Not About Breaking Things

The word "chaos" has done more damage to the discipline than any outage ever did. If you can't write the hypothesis down, you're not doing chaos engineering — you're doing vandalism in a GameDay hat.

The worst thing that ever happened to chaos engineering was the word "chaos."

It conjured a particular image: a rogue engineer with a red button, killing production pods for fun, cackling while the on-call pages light up. For years I've watched that image shape what people think the practice is. It's no wonder a lot of leaders reflexively refuse to let it anywhere near production. The marketing was bad.

Chaos engineering isn't about breaking things. It's a disciplined practice of forming hypotheses about system behavior under stress and running controlled experiments to validate them. The word we should've used from the start is experimental.

The shape of a real experiment

A real chaos experiment has four parts, and if you can't fill in all four, you shouldn't be running it.

1. The hypothesis

A statement about what you expect. "If the payment service's latency degrades to 2 seconds for 10% of calls, the checkout flow will continue to complete successfully, with an average latency increase of less than 300 milliseconds, and no elevated error rate."

That's a real hypothesis. Notice that it's specific, it's measurable, and — crucially — you can be wrong. If the hypothesis can't be disproven by the experiment, the experiment has no scientific content.

2. The blast radius

What does this experiment affect, and what does it not affect? This is the safety half of the work. You explicitly enumerate: "We are injecting 2-second latency into 5% of calls from the staging-payment service to the fraud-check service, for 10 minutes, in the us-east-1 region, between 10 AM and 10:10 AM on Tuesday."

The smaller and more specific the blast radius, the safer the experiment. Real chaos programs start at 1% of a single service in staging and grow — slowly — from there.

3. The abort condition

What happens if you were wrong? You need a pre-committed trigger for stopping the experiment. Usually a specific metric crossing a specific threshold: "If the checkout error rate exceeds 2%, or if the p99 exceeds 4 seconds, we halt the experiment automatically."

Abort conditions aren't optional. They're the only thing separating an experiment from an incident. If you don't have one, you are not running an experiment. You are gambling.

4. The observation plan

What are you going to measure, from where, and how will you decide whether the hypothesis held? Before the experiment starts, the dashboards are open, the relevant metrics have known baselines, and there's an explicit "this is what success looks like" statement. After the experiment, you write up the result and either confirm the hypothesis or don't.

"We ran the experiment and nothing bad happened" is not an observation plan. It's a sigh of relief.

What you're actually testing

Here's the insight that took me a while to internalize: the failures you inject are not really the point. The point is whether your mental model of the system matches reality.

When you write the hypothesis "the checkout flow will complete successfully if the fraud service is slow," you're writing down what you believe. When you run the experiment, you find out if you were right. The value isn't in breaking the fraud service — the value is in the moments when you thought you were right and you weren't.

Every incident is, at bottom, a divergence between your model of the system and its actual behavior. Chaos engineering is a way to surface those divergences on purpose, in safe conditions, rather than letting them ambush you at 2 AM.

Where to run it

A common question: "Should we run chaos experiments in production?"

Eventually, yes. Not on day one.

Here's a progression that works:

Local and test environments. Start with injecting faults into integration tests. Run with pumba, Toxiproxy, or your library of choice. Build the muscle of "experiment" before the stakes are real.
Staging with synthetic traffic. Generate a steady workload, inject faults, observe. This is where most of the early learning happens — and where you catch the obvious gaps in your retry/timeout/breaker design.
Production with blast-radius controls. Start at 1% of traffic, one service, one fault type. Scale up only when you've run several successful experiments and the observation discipline is solid.

The reason to eventually reach production is simple: staging is a lie. It has synthetic traffic patterns, different data shapes, different cache hit ratios, different dependency latencies, often different service versions. Experiments in staging validate that your staging behaves the way you expect. Only production experiments validate that production does.

GameDays: the on-ramp

If your organization isn't ready for automated chaos experiments, start with GameDays. A GameDay is a scheduled, live, hands-on exercise where the team injects a failure into a non-critical environment and practices the response. Announced in advance. Observed by everyone. Followed by a retrospective.

GameDays are less about finding technical bugs (though they do) and more about exposing the human system around the technical one. Who's on-call? Does the runbook work? Does the alert fire? Is the escalation path clear? Does the war room tool work? These are the things that matter at 3 AM, and they can only be tested under something like real conditions.

I consider a team ready for automated chaos engineering only after they've run a dozen successful GameDays. Skipping the GameDay phase and going straight to automated fault injection is how chaos engineering programs become cautionary tales.

The prerequisites people don't talk about

Chaos engineering presumes things that are often missing.

Working observability. If you can't see what the system is doing in normal conditions, you won't be able to see what's different under stress. Fix observability before you start injecting faults.
Defined SLOs. The hypothesis and the abort condition both rely on knowing what "good" looks like. Without SLOs, you're guessing whether the system is degrading.
A mature incident response process. If a real incident during the experiment would catch the team off guard — wrong people paged, runbook missing, escalation unclear — the experiment itself could become the incident.

Chaos engineering is often what teams reach for when they feel like they don't have enough reliability. It's actually something you do after you have the observability and SLO foundations. Doing it before those exist is putting a spoiler on a car whose engine doesn't start.

What to do this quarter

Run one GameDay against a non-critical service. Pick a failure mode you think you've designed for. Invite the on-call engineer, the service owner, and a skeptic. Write down the hypothesis. See if you were right.
Document the result in whatever form your team reads — a retro doc, a Slack thread, a one-pager. The knowledge dies if it doesn't make it out of the room.
Fix one thing that the experiment surfaced, and nothing more. The temptation will be to fix everything. Resist. A chaos program that produces one solid improvement per quarter is better than one that produces a long backlog of unfinished fixes.

Chaos engineering at its best looks boring from the outside. No cackling, no red buttons, no dramatic outages. Just a team methodically checking their mental model against reality and adjusting one or the other when they diverge. That's the practice. Everything else is theatre.

#chaos-engineering#reliability#experiments#gameday