The Blog
Long-form takes from the field — what I've learned building, breaking, and instrumenting production systems.
Averages hide the pain your users actually feel. Here's why every serious performance engineer reports in percentiles — and how to pick the right one.
When the system under test stalls, most load generators quietly stop sending new requests — which means the stalls never get recorded. Your p99 is a fantasy.
Teams strip user_id, request_id, and tenant_id out of their telemetry to save money, then wonder why they can't debug production. Cardinality is not the enemy — it's the point.
The word "chaos" has done more damage to the discipline than any outage ever did. If you can't write the hypothesis down, you're not doing chaos engineering — you're doing vandalism in a GameDay hat.
"Shift left" has become "run a full load test on every PR," which is why developers resent performance work. Real shift-left is three layers, not one.
Every alert should answer three questions at page-time: what is broken, what is the user impact, and what should I do next. If it can't, it's not an alert — it's a notification.
Availability isn't a scalar. A system that serves stale data during a database outage is more available than one that serves a 500 — even though both are "down."
These three patterns are taught separately, deployed separately, and fail together. They are one pattern in three clothes — here's the combined state machine and the defaults that actually protect production.
"Load test" has become a catch-all that conflates four different jobs with four different stakeholders. That's why most load-testing programs stall after six months.
Logs, metrics, and traces were a useful onboarding frame in 2017. In 2026 they're an active impediment — and the vendors selling them to you like it that way.