The Blog

Notes on performance, resiliency, and observability.

Long-form takes from the field — what I've learned building, breaking, and instrumenting production systems.

Beyond the Average: Why Percentiles Matter More Than Mean Response Times

Averages hide the pain your users actually feel. Here's why every serious performance engineer reports in percentiles — and how to pick the right one.

April 10, 20267 min read

Performance Engineering

The Coordinated Omission Problem: Why Your Load Test Is Lying to You

When the system under test stalls, most load generators quietly stop sending new requests — which means the stalls never get recorded. Your p99 is a fantasy.

February 16, 202610 min read

Observability

High-Cardinality Data: The Observability Superpower You're Probably Avoiding

Teams strip user_id, request_id, and tenant_id out of their telemetry to save money, then wonder why they can't debug production. Cardinality is not the enemy — it's the point.

December 15, 20259 min read

Resiliency Engineering

Chaos Engineering Is Not About Breaking Things

The word "chaos" has done more damage to the discipline than any outage ever did. If you can't write the hypothesis down, you're not doing chaos engineering — you're doing vandalism in a GameDay hat.

November 17, 20259 min read

Performance Engineering

Shifting Performance Left Without Slowing Down Delivery

"Shift left" has become "run a full load test on every PR," which is why developers resent performance work. Real shift-left is three layers, not one.

October 20, 20259 min read

Observability

From Alert Fatigue to Signal Clarity: Building Actionable Alerts

Every alert should answer three questions at page-time: what is broken, what is the user impact, and what should I do next. If it can't, it's not an alert — it's a notification.

September 22, 20258 min read

Resiliency Engineering

Graceful Degradation: Designing Systems That Bend Instead of Break

Availability isn't a scalar. A system that serves stale data during a database outage is more available than one that serves a 500 — even though both are "down."

August 18, 20259 min read

Resiliency Engineering

Timeouts, Retries, and Circuit Breakers: The Resilience Trinity

These three patterns are taught separately, deployed separately, and fail together. They are one pattern in three clothes — here's the combined state machine and the defaults that actually protect production.

July 21, 202510 min read

Performance Engineering

Load Testing Is Not Performance Testing: A Field Guide to the Difference

"Load test" has become a catch-all that conflates four different jobs with four different stakeholders. That's why most load-testing programs stall after six months.

June 16, 20259 min read

Observability

The Three Pillars Are a Lie: What Observability Actually Means

Logs, metrics, and traces were a useful onboarding frame in 2017. In 2026 they're an active impediment — and the vendors selling them to you like it that way.

May 12, 20258 min read