← All posts
Performance Engineering··10 min read

The Coordinated Omission Problem: Why Your Load Test Is Lying to You

When the system under test stalls, most load generators quietly stop sending new requests — which means the stalls never get recorded. Your p99 is a fantasy.

Picture a load test. Your tool is configured to send 1,000 requests per second. The system under test is humming along, reporting a beautiful 50 ms median, 120 ms p95, 180 ms p99. You screenshot the result and paste it into the release readiness doc. Everyone nods. You ship.

That p99 is a lie. Not by a little. Sometimes by orders of magnitude.

This is coordinated omission. Gil Tene has been shouting about it for over a decade, and it still shows up in load tests that cost companies millions of dollars. Let me explain what's happening and how to fix it.

How most load generators work

A typical load test runs a pool of virtual users (threads, goroutines, coroutines — pick your flavor). Each one executes a loop:

loop:
  send request
  wait for response
  record latency
  sleep briefly to maintain target rate

This looks sensible. Each virtual user sends requests in a steady cadence, waits for the response, records how long it took, and sleeps just long enough to hit the target rate.

Here's the subtle problem. What happens when the system under test stalls?

The stall that doesn't exist

Say the target rate is one request per second per virtual user, and the system is normally responding in 50 ms. Then the system hits garbage collection and stalls for 10 seconds.

During that stall, your virtual user has exactly one request in-flight. It's waiting. When the stall ends and the response comes back, the user records a latency of 10 seconds.

Then — and this is where it gets ugly — the virtual user sends the next request and records its latency. That one returns in 50 ms. And the next. And the next.

How many requests should the virtual user have sent during that 10-second stall? At one per second: ten requests. Every one of those ten requests would have been affected by the stall. Every one would have experienced latency somewhere between 50 ms (for a request sent right at the end of the stall) and 10 seconds (for the first). On average, around 5 seconds of latency.

But the load generator didn't send those ten requests. It was waiting for the one in-flight request to complete. So those ten requests — the ones that would have shown the worst of the stall — never happened. They were coordinately omitted.

The tool recorded one 10-second result. It should have recorded something like eleven results averaging five seconds. Your latency distribution is missing almost all of the bad data, and the p99 you compute from what's left is a flattering fiction.

How bad is it?

Very. Tene's classic demonstration takes a system with a simple stall pattern and shows that the p99 reported by a naïve load generator can be two orders of magnitude better than the actual p99 a user would experience. A test that says "p99 is 200 ms" can be covering a reality where one user in a hundred is waiting several seconds.

The effect gets worse the more contention and coordination there is in the system — which is to say, worse for exactly the systems you most want to test.

The fix, in three parts

1. Constant-throughput generation, independent of response

The load generator should send requests at a fixed rate, regardless of whether previous requests have completed. If the target is 1,000 requests per second, the 500th millisecond produces the 500th request, full stop. If responses are slow, requests pile up in flight. They don't slow down the sender.

Tools that do this correctly: wrk2, k6 with constant arrival rate executor, Gatling with constantUsersPerSec, Locust with constant_total_ips. Tools that historically got it wrong: the default JMeter thread model, basic wrk, most "closed model" generators.

The distinction is sometimes called open model vs. closed model. Open-model load — where the arrival rate is independent of system response — is what you want for capacity and regression testing. Closed-model load — fixed number of users looping — is what you want when you're simulating a specific user population whose size doesn't change. Most web-scale testing is open-model, even though closed-model is the default in most tools.

2. Correct for coordinated omission in recording

Even with constant throughput generation, there's a subtler form of the problem. If you record latency naively — "response received - request sent" — you record the latency of a request from its own perspective. But the user's perspective is different: they experienced the latency from the moment they intended to send the request, which is the scheduled send time, not the actual send time.

If the tool queues requests because threads are busy, the queue time is part of the user's perceived latency. HdrHistogram has a built-in correction — recordValueWithExpectedInterval— that inserts synthetic samples for the gap between intended and actual sends. The mental model: if a request should have been sent at time T but actually went at T+2s, the histogram pretends it saw two samples with 2s and 1s of latency respectively, representing the experience of users who scheduled arrivals during that gap.

This is the "corrected by intended rate" recording that the good tools build in. The bad ones don't, and quietly lie.

3. Use HdrHistogram (or t-digest) for percentile math

The third piece isn't about omission directly, but it falls in the same trap: most percentile computation breaks when the distribution has long tails.

Naïve percentile computation usually involves sorting all measurements, which is fine for small runs but infeasible for long tests. The common alternative — keeping bucketed counters — loses tail precision precisely where you care about it. HdrHistogram and t-digest are algorithms specifically designed to keep a small, fixed-memory summary that maintains high precision across the whole range, including the p99.9 and p99.99 tail.

If you're ever tempted to compute a p99 by averaging p99s, see the earlier post on percentiles — short version: you can't. Merge the histograms, then compute the percentile.

The test environment trap

One more thing. Even a correctly configured load generator can produce bad results if the generator itself is the bottleneck. A single-machine load generator driving a production-shaped system will eventually run out of open sockets, CPU, or ephemeral ports, and start producing its own stalls. Those stalls show up in your results as if they were the system's — and they're even worse than coordinated omission because they're your tool's fault.

Checks before you trust any load-test result:

  • The generator is not saturated. CPU below 70%, sockets below the ulimit, network below 60% of interface capacity.
  • Generator and target are on the same network. Cross-region or cross-datacenter generation adds noise that has nothing to do with the system under test.
  • The test was warm. Discard the first minute or two of any run. JIT, caches, and connection pools make the cold start unrepresentative.

Validating your own tool

Want to know if your load-testing setup has a coordinated omission problem? Run this experiment.

  1. Stand up a trivial service that sleeps 50 ms and returns 200.
  2. Every 30 seconds, make it sleep 10 seconds instead. (A simple global flag, toggled on a timer, is enough.)
  3. Run your load test at, say, 100 requests per second for 5 minutes.
  4. Look at the p99 your tool reports.

A correct tool will report a p99 somewhere near 10 seconds — the stall dominates enough of the traffic during the stall windows that the top 1% of latencies should reflect it. A broken tool will report a p99 around 50-100 ms, because the stalls were coordinately omitted and only the happy-path samples made it into the recording.

This five-minute experiment has shaken more than one "we've been doing load testing for years" program. It's a cheap diagnostic for a problem that's usually invisible.

Why this matters beyond load testing

Coordinated omission isn't just a load-testing bug. The same pattern shows up in production monitoring. If your metrics client drops samples when the pipeline is backed up, your latency percentiles understate the tail during exactly the periods when the tail is most interesting. If your application times out and retries without recording the original request's latency, you've coordinately omitted your own production data.

The general lesson: whenever a measurement system backs off under stress, you should assume it's hiding the part of the distribution you most want to see. Check whether it does. Fix it if it does.

What to do this week

  1. Find out what your load generator does when responses slow down. Read the docs. If it says anything like "each virtual user waits for response before sending next request," you have coordinated omission and need to either switch tools or switch modes.
  2. Run the 10-second stall experiment above. Calibrate your tool against a known-bad system. You'll learn more in ten minutes than most people learn about their load tests in a year.
  3. Start recording with HdrHistogram and report p99, p99.9, and the max. Stop reporting averages. The tail is where the users live.

If a load-test result looks too clean, it probably is. The system isn't that quiet. The tool is just polite.

#load-testing#latency#coordinated-omission#hdr-histogram

© 2026 kaushaldalvi.com, All rights reserved.