From Alert Fatigue to Signal Clarity: Building Actionable Alerts
Every alert should answer three questions at page-time: what is broken, what is the user impact, and what should I do next. If it can't, it's not an alert — it's a notification.
I once joined a team where the on-call rotation received an average of 47 pages per week. When I asked the on-call engineer what was on fire, she said "nothing, really." She wasn't being flip. Of those 47 pages, maybe three needed action. The rest were... furniture.
That team didn't have an alerting system. It had a notification system pretending to be one. And the people on call had, very sensibly, stopped taking any of it seriously.
The three-question test
A useful test for any alert: if this page wakes me up at 2 AM, can I answer these three questions from the page itself?
- What is broken? Not the symptom — the actual thing. "Checkout error rate is 8%" is better than "Elevated HTTP 500s." "Database connection pool saturated" is better than "pg_stat anomaly."
- What is the user impact? "Roughly 120 users per minute cannot complete checkout." Without this, I cannot tell if this is a five-minute problem or a five-hour problem.
- What should I do next? A link to a runbook, or at least a sentence. "Check the connection pool dashboard, consider bouncing pgbouncer, escalate to DBA if unchanged after 5 minutes." No runbook link means whoever's on-call now has to invent a process while bleeding.
If your alert can't answer all three, it has no business paging a human. Demote it to a dashboard, a ticket, a Slack message — anything except a phone vibration at 2 AM.
The structural fix: SLO-based alerting
Most alert fatigue starts with threshold-based alerts. "CPU > 80%." "Response time > 500ms." "Error rate > 1%." These fire constantly because the thresholds are arbitrary and the signals are transient. Nobody wants to be woken up for a 30-second CPU spike that autoscaling handled cleanly.
SLO burn-rate alerts are the structural fix. Instead of "this metric is high right now," they ask "are we burning through our error budget faster than we can afford to?"
Say your SLO is "99.9% of checkout requests succeed, measured over 30 days." That gives you an error budget of 0.1% — about 43 minutes of full downtime per month. A burn-rate alert says: "over the last hour, we're consuming error budget fast enough that we'd exhaust a month's budget in six hours." That's worth paging for. A single elevated error-rate blip that recovers isn't.
The Google SRE book describes multi-window, multi-burn-rate alerts — usually two alerts per SLO, one fast (short window, high burn rate, catches acute incidents) and one slow (long window, lower burn rate, catches chronic erosion). That pair replaces maybe a dozen threshold alerts and cuts the noise by an order of magnitude.
Put the runbook in the page
The single cheapest upgrade to any alert is adding a runbook link. Not a wiki page that references the runbook. Not a Confluence dashboard. The runbook itself, clickable from the page in your phone.
A good runbook is short — a one-pager. It contains:
- What this alert means, in one sentence.
- The top two or three things to check, with direct links to the relevant dashboards or queries.
- Common causes, ranked by frequency.
- Remediation steps, including any "nuclear option" with a clear sign-off requirement.
- Escalation: who to call if you're stuck.
Runbooks rot if you don't maintain them. The rule I apply: every time an on-call engineer answers a page, they update the runbook if they did something the runbook didn't predict. That's a small tax that keeps the runbook alive.
The monthly cull
Here's a practice that separates teams with healthy alerting from teams that drown in it: every month, look at every alert that fired, and ask:
- Did a human take action? If no — demote or delete.
- Was the action the same every time? If yes — automate it.
- Did it fire during a known-good state? If yes — the threshold is wrong; tighten or replace with an SLO burn.
That three-question audit, done monthly, naturally drives the alert list toward "short and angry" instead of "long and boring." Short and angry is what you want — when a page fires, the on-call should believe it.
Symptoms, not causes
A small but important principle: alert on user-visible symptoms, not internal causes. "Checkout success rate dropped" is the right alert. "Replica lag exceeded 5 seconds" is usually the wrong one — not because replica lag doesn't matter, but because it's a cause that sometimes doesn't produce a symptom, and sometimes produces one that's nothing to worry about.
The exception: causes worth alerting on are the ones that will produce a user-visible symptom if unaddressed. Disk filling up at the current rate, runway on certificates expiring, quota usage trending toward limits. These are predictive — and they belong in a different alerting tier than acute user pain.
Severity tiers that mean something
Most alerting systems ship with three to five severity levels that all mean the same thing in practice: "somebody should look at this eventually." Be strict.
- Page (wakes you up). User-visible impact happening right now. SLO burn at a rate that erodes the month in hours. You drop everything.
- Ticket (business-hours). Something is degrading. Not bleeding. Look at it during the day.
- Log/dashboard (no routing). Informational. Only looked at when someone is already investigating.
Three tiers. Everything fits. The moment you have "P2 that kind of pages" or "P3 that sometimes pages," you've rotted the hierarchy and the on-call can't predict what's urgent.
What to do this quarter
- Audit a week of pages. For each, mark whether it required action. If fewer than 50% did, your alerts have lost credibility; overhaul is cheaper than patching.
- Replace your top three noisy threshold alerts with SLO burn-rate alerts. Pick the three by raw page count. The noise drops dramatically and the signal sharpens.
- Write a one-page runbook for every alert that pages, and put the link directly in the page template. If you can't write a runbook for an alert, that alert should not exist.
The goal isn't an empty pager. It's a pager you trust. A short, angry list of pages that always mean something is the shape of a healthy on-call rotation — and it's the most humane thing you can do for the people carrying one.