Resiliency Engineering·August 18, 2025·9 min read

Graceful Degradation: Designing Systems That Bend Instead of Break

Availability isn't a scalar. A system that serves stale data during a database outage is more available than one that serves a 500 — even though both are "down."

There's a question I like to ask architecture reviewers: "What happens when the database is down?" The honest answer, nine times out of ten, is "the site is down." That answer is treated as normal. It is not normal. It is a choice — one made quietly, by default, by every team that didn't think to ask.

Graceful degradation is the practice of asking the question on purpose. Of designing the "dependency is sick" path as carefully as the happy path. Of accepting that availability is not binary and acting like it.

Availability is not a scalar

We talk about availability like it's a single number — "five nines," "three nines," an uptime percentage on a status page. That framing only makes sense if the system is either fully up or fully down. Most real systems aren't.

Consider a product page. Fully working means: product info, inventory, reviews, recommendations, price, add-to-cart, all live. Now take the recommendations service away. Is the page down? Arguably not — a customer can still buy. Now take the inventory service away and replace it with "usually available." Is the page down? Still arguably not — most items really are usually available, and honesty about uncertainty beats a blank page.

Every one of those components going offline changes the page's character rather than binary-flipping its availability. The system is a spectrum, and the design job is to pick which degradations are acceptable and which are not.

Load-bearing vs. garnish

The first move in any degradation design is to classify your dependencies. I use a simple three-bucket split:

Load-bearing. Without this, the experience can't exist. For an e-commerce checkout: the payment processor, the order-creation service, the auth system. If these are down, you accept the feature is down and you make the error message dignified.
Structural. Removing this damages the experience but doesn't end it. Product metadata, inventory, session caching. There's usually a stale-but-valid version of this data somewhere — caches, replicas, CDN — and you design the system to reach for it when the live source is unreachable.
Garnish. Nice-to-haves that nobody misses when they're gone. Recommendations, trending badges, "23 people are looking at this item right now." If they're slow, skip them. If they're broken, render without them and log a warning.

The trap I've seen over and over is that every dependency gets treated as load-bearing, because nobody wants to sign the form saying it's garnish. The result is a system where a third-party review widget can take down the checkout page. That's not a review widget; that's a load-bearing wall in a disguise.

Degradation patterns

A handful of patterns cover most of the work.

Stale-while-revalidate

Serve the cached version of the data while you try to refresh it in the background. If the refresh succeeds, next request gets the fresh version; if not, next request gets the same stale version you just served. This is the single highest-leverage degradation pattern because it's nearly invisible in the happy path and transformative in the bad one.

Works beautifully for: product metadata, pricing (with a ceiling on staleness), search results, anything where "slightly old" is dramatically better than "nothing."

Read-only mode

When writes are failing but reads aren't, don't hide the system. Serve reads. Put a banner up: "We're having trouble processing changes right now — you can still browse." Users who wanted to browse aren't blocked. Users who wanted to write know what's happening and can come back.

This is one of the rare patterns that's both a UX decision and an architectural one: your write path has to fail fast and loudly, and your read path has to be resilient to the write path's absence.

Feature flags as degradation levers

Any non-trivial feature should have a switch that turns it off without a deploy. Not for feature-flag-driven-development — for incident response. When a recommendations service is melting the database at 3 AM, someone in the war room should be able to turn off recommendations in ten seconds.

The test: can your on-call engineer, at 3 AM, half-asleep, kill a non-essential feature with a single action? If not, you don't have a degradation lever. You have a code path.

Bounded defaults

When a dependency is unreachable, what do you show in its place? The wrong answer is "error." The right answer is frequently a reasonable default that's explicitly flagged as degraded.

Shipping estimate service down? Show "ships in 3-5 business days" — the domain average — with a small note that this is an estimate. Inventory service down? Show "usually available" with a disclaimer. A customer who sees a helpful approximation is better served than one who sees a blank space.

The decision isn't technical, it's product

Here's where graceful degradation gets interesting. The question "what should we show when inventory is down?" sounds like an engineering question. It isn't. It's a product and legal question. Can we show "usually available" and risk an unhappy customer when a backorder turns out to be a four-week wait? Is the reduced trust worth the completed purchases we saved? How do we message the uncertainty?

Engineers shouldn't make that call alone. Product and legal should be in the room when you design the degraded experience, because the degraded experience is a product. Not a hack. Not a fallback. A product.

The teams that get this right produce a document — sometimes called a "degradation matrix" — listing every dependency, what happens when it's down, and what the user sees. The document is signed by engineering, product, and the business. When the dependency eventually goes down, nobody is making it up on the fly.

Testing degradation

A degradation path that hasn't been tested isn't a degradation path. It's a hope.

Every degradation lever I've put into a system has, at some point, silently broken. Code got refactored, the fallback stopped being called, and nobody noticed until the day the dependency actually went down — at which point the "degradation" was a 500. The lesson: run the degraded path continuously. Either ship a small percentage of live traffic through the fallback on purpose, or run a scheduled chaos experiment that disables the dependency and watches what happens.

If you're not comfortable running that experiment in production, you don't actually trust your degradation design. Find out now, not at 3 AM.

What to do this quarter

Draw the degradation matrix for your top user-facing flow. One row per dependency. Columns: "down behavior," "user-visible message," "owner of that decision." The act of filling it in will reveal more than any architecture review.
Find one load-bearing-in-disguise dependency and move it to garnish. Usually there's a scoreboard widget, a social share button, a third-party tracking pixel — something that can silently take your page down and doesn't deserve to.
Add one feature flag to disable a non-essential feature in under 30 seconds. Document it in the runbook. Test it in staging.

The goal isn't that nothing ever breaks. The goal is that when things break, the system bends — and the users barely notice. That's the difference between an outage and a blip. And it is almost entirely a product of what you decided, on purpose, before anything went wrong.

#resilience#availability#architecture#ux