← All posts
Performance Engineering··7 min read

Beyond the Average: Why Percentiles Matter More Than Mean Response Times

Averages hide the pain your users actually feel. Here's why every serious performance engineer reports in percentiles — and how to pick the right one.

Somewhere on a dashboard right now, an average response time is glowing green. The team is happy. Leadership is happy. And somewhere else — on a real user's laptop, on a flaky Wi-Fi network, on the third page of a checkout flow — the application is taking eleven seconds to respond.

The average doesn't care. The average is a liar, and it's one of the most persistent lies in performance engineering.

The arithmetic of self-deception

Here's the problem with the mean. Imagine a hundred requests. Ninety of them return in 200 ms. Ten of them return in 4 seconds. The average is 580 ms. That number is technically correct and practically useless. It describes a population that doesn't exist: nobody got a 580 ms response. Ninety people got a fast one. Ten people got an infuriating one. The average just smeared the two groups into an imaginary middle.

Averages are the arithmetic mean's way of being polite. They round off the edges and hand you a comfortable number. Comfort is exactly the wrong thing to optimize for when you're trying to find out where a system hurts.

What percentiles actually tell you

A percentile answers a specific, useful question. p95 means: "what's the slowest response the fastest 95% of users got?" Put another way — if you line every request up from fastest to slowest, the request at the 95th-out-of-100 position is your p95.

The beauty of this is that it doesn't lie about the distribution. If your p50 is 180 ms and your p95 is 4 seconds, you instantly know two things: most of your traffic is fine, and a real, quantifiable chunk of users is having a bad time. The gap between p50 and p95 is the shape of pain.

I care about three percentiles on every system I work on:

  • p50 (median) — the typical experience. Tells me what a representative user sees. Moves when broad changes happen.
  • p95 — the edge of "normal." If p95 is bad, a non-trivial slice of real users is suffering. This is the number product managers should be staring at.
  • p99 — the tail. This is where cold caches, garbage collection pauses, connection churn, and retry storms live. It tells you how ugly your worst case actually is.

"But averages are easier to reason about"

I hear this a lot. It isn't true. Averages are easier to compute. They are not easier to reason about, because they describe a fiction. What people actually mean is "I grew up on averages and I trust them." That's a habit, not an argument.

Percentiles take a little getting used to, but they pay you back immediately. The first time you watch your p50 stay flat while your p99 doubles, you've caught a real regression that an average would have hidden for weeks.

The percentile you pick is a product decision

Here's a subtlety that often gets skipped. Choosing between p95, p99, and p99.9 isn't a technical decision — it's a decision about how many unhappy users you're willing to tolerate.

Do the math. If you serve 10 million requests a day and you set your SLO at "p99 < 500 ms," you've just said "100,000 requests a day can be slower than 500 ms and that's fine." That's a quarter of a million slow requests a week. For an e-commerce flow, that might be unacceptable. For a nightly batch-ish internal tool, it's probably overkill.

The right percentile depends on the blast radius of a slow request. Checkout? Push to p99.9. Analytics export? p95 is plenty. The SLO conversation is a product conversation dressed in engineering clothes.

The trap of averaging percentiles

This one catches people constantly. You cannot average p95s.

If service A has a p95 of 100 ms and service B has a p95 of 300 ms, the combined p95 is not 200 ms. It could be anywhere from "about 300 ms" to "much worse than 300 ms" depending on how the distributions compose. Percentiles are order statistics; they don't add.

The same trap shows up when dashboards aggregate percentiles over time. A p95 over the last 24 hours is not the mean of 24 hourly p95s. If your observability tool is doing this quietly in the background, you're looking at a number that feels precise and is, in fact, roughly made up.

The fix: compute percentiles from the underlying histogram, not from pre-aggregated percentile values. Tools like HdrHistogram and t-digest exist specifically so you can merge distributions without losing the tail. Use them.

What to do Monday morning

Three practical changes, in order of how much leverage they buy:

  1. Replace every "avg response time" panel on your dashboards with p50, p95, and p99 on the same chart. Keep the old average for a sprint if you must, for comparison. Then delete it.
  2. Write your SLOs in percentiles, with an explicit traffic volume. "p99 < 400 ms over a 30-day window" is a claim you can be held to. "Average response time under 300 ms" is a vibe.
  3. Check how your tooling aggregates. Ask your vendor — or read the docs — and find out whether long-range percentiles come from merged histograms or from averaged percentiles. If it's the second one, your tail numbers are decorative.

None of this is new. Gil Tene has been shouting about it for a decade. Every serious performance team I've worked with has made the jump. And yet "average response time" stubbornly clings to dashboards everywhere, because it's comfortable, and because nobody ever got fired for putting a mean on a slide.

Get comfortable being uncomfortable. Your users already are.

#latency#metrics#load-testing#slos

© 2026 kaushaldalvi.com, All rights reserved.