Resiliency Engineering·July 21, 2025·10 min read

Timeouts, Retries, and Circuit Breakers: The Resilience Trinity

These three patterns are taught separately, deployed separately, and fail together. They are one pattern in three clothes — here's the combined state machine and the defaults that actually protect production.

Every junior engineer can name the three. Timeouts. Retries. Circuit breakers. They show up in interview questions, design docs, and blog posts as a tidy little list of three distinct patterns you're supposed to sprinkle on your network calls like seasoning.

They are not three distinct patterns. They are one pattern, and treating them separately is how production outages happen.

The three failures, cleanly stated

Let's start with what each is actually for.

Timeout answers: "How long am I willing to wait before I decide this call failed?"
Retry answers: "If it failed, how many more times am I willing to try before I give up?"
Circuit breaker answers: "If failures keep happening across many calls, when do I stop calling entirely so I don't make things worse?"

Stated that way, you can see immediately that they're answers to the same question at different scopes: timeout is per call, retry is per request, circuit breaker is per dependency. They are a hierarchy, not a trio.

Why they fail when deployed separately

Here are the three combinations that hit production and ruin weekends.

Retry without timeout

You configure max_retries=3. A call hangs forever. You are not retrying three times. You are hanging forever on attempt one. Retries that don't trigger need something to trigger them; that something is usually a timeout.

If you've ever had a thread pool exhausted by calls that were "in-flight" for six hours, this is what happened. There was no upper bound on how long a call could linger, so nothing ever released the worker.

Circuit breaker without timeout

The circuit breaker opens when failure rate crosses a threshold — say, 50% of the last 20 calls failed. But if calls hang instead of failing, they never register as failures. They just sit. The breaker never opens because, as far as it knows, no failures have happened yet. Meanwhile every worker in your pool is stuck waiting on a dead dependency.

Congratulations, your circuit breaker has been rendered decorative by a missing timeout.

Timeout without retry or breaker

The call times out cleanly. The caller gets an error. The user sees a 500 page for a transient glitch that would've worked on a second try. No retry means every transient blip becomes a user error. No breaker means there's no protection against a degrading dependency dragging you down with it.

Timeouts alone give you determinism. They don't give you resilience.

The combined state machine

Think of it as one state machine with three nested scopes.

per call:
  start timer (TIMEOUT)
  if success within timeout → return result
  if failure (error or timeout) → record failure

per request:
  on per-call failure:
    if attempts < MAX_RETRIES and error is retryable:
      wait (BACKOFF)
      retry per-call
    else:
      fail the request

per dependency:
  track success/failure rate
  if failure_rate > OPEN_THRESHOLD:
    open breaker → fail fast for COOLDOWN
  after COOLDOWN:
    half-open → allow one probe call
    if probe succeeds → close breaker
    if probe fails → re-open for COOLDOWN

Read from the bottom up: the circuit breaker is the outermost guard, deciding whether the request is even attempted. The retry loop sits inside, deciding whether a failure gets another shot. The timeout sits innermost, deciding when a single attempt is considered failed.

None of the three can do its job without the other two in place.

Defaults that actually protect production

Rather than abstract advice, here are defaults I've landed on over enough incidents to trust them. Adjust for your domain — but these are a sane starting point, not zeros.

Connection timeout: 1-2 seconds. If you can't open a TCP connection in 2 seconds, the network or the dependency is sick. Retry is cheap here.
Read timeout: tied to the caller's SLO, not the callee's P99. If your API's p99 budget is 500ms and you're calling a downstream service, that call cannot be allowed to take more than ~300ms. Budgets propagate.
Overall request deadline: enforced at the entry point and propagated (gRPC deadlines, HTTP Deadline headers, context deadlines in Go). Downstreams should see the remaining budget, not start their own stopwatch.
Max retries: 2, maybe 3. If three retries aren't enough, retrying more won't help — the dependency is down.
Backoff: exponential with jitter. Without jitter you synchronize all clients to retry at the same moment and you've invented a thundering herd.
Retry budget: never more than ~10% of your traffic at any time. A fleet-wide retry storm is how a dependency's blip becomes your full outage.
Breaker threshold: 50% failure rate over a rolling window of 20-50 calls, plus a minimum-requests floor so a handful of failures in a low-traffic service don't trip the breaker immediately.
Breaker cooldown: 30 seconds to start with. Short enough to recover quickly, long enough to let the downstream breathe.

The retryable-vs-not question

A subtle but important detail: not every failure should trigger a retry. Retrying a POST /charge call that actually succeeded but timed out on the response can double-charge a customer. Retrying a 400 Bad Request does nothing useful — the request is broken, it'll be broken next time too.

The rule I use: retry on connection errors, timeouts, and 5xx responses marked idempotent. Do not retry on 4xx. Do not retry on non-idempotent writes unless you have an idempotency key. The protocol is idempotency keys; the fallback is "don't retry this call at all."

Where this breaks down

The trinity is necessary, not sufficient. There are three failure modes it won't protect you from:

Correlated failures at higher layers. If region-level DNS is failing, circuit breakers at the service level aren't the right guard — you need cross-region failover.
Slow successes. A dependency that replies in 2 seconds with a 200 isn't triggering any of your failure detectors, but it's still dragging your p99 into the ground. You need latency-based breakers (eject slow replicas) for this.
Self-inflicted DoS. Retries with insufficient backoff, even with jitter, can turn a brief blip into a persistent overload. Token-bucket retry budgets are the fix.

These are not reasons to skip the trinity. They're reasons to treat it as the floor, not the ceiling.

The one-sentence summary

If you remember one thing: never deploy a retry without the timeout that triggers it and the breaker that stops it. The three are always on together, or they're decorative.

#resilience#timeouts#retries#circuit-breaker