Circuit Breaker Pattern: Building Resilient Services

Cascading Failures Are the Real Danger

A single service going down is manageable. The real danger is when one service's failure takes out every service that depends on it, and then every service that depends on those, until the entire system is unresponsive.

Here is how it happens. Service A calls Service B. Service B is overloaded and responding slowly — not failing outright, just taking 30 seconds instead of 200 milliseconds. Service A's thread pool fills up with requests waiting on Service B. Service A stops responding to its own callers. Service C, which depends on Service A, fills up its thread pool waiting on A. The cascade propagates upstream until the user-facing application is completely unresponsive, even for features that have nothing to do with Service B.

The root cause is that Service A keeps trying to call Service B even though B is clearly in trouble. Each call ties up resources. The retry logic, designed to handle transient failures, makes things worse by multiplying the load on an already-struggling service.

The circuit breaker pattern interrupts this cascade by detecting when a downstream service is failing and stopping calls to it before they consume resources.

How the Circuit Breaker Works

The circuit breaker is a state machine with three states:

Closed is the normal operating state. Requests pass through to the downstream service. The circuit breaker monitors the results — tracking failure rates, timeouts, and error counts over a rolling window. As long as the failure rate stays below a configured threshold, the breaker remains closed.

Open is the failure state. When the failure rate exceeds the threshold — say, more than 50% of calls in the last 30 seconds have failed — the breaker trips open. All subsequent calls fail immediately without contacting the downstream service. Instead of waiting 30 seconds for a timeout, the caller gets an immediate failure response. This is the key behavior: failing fast preserves the caller's resources.

Half-open is the recovery probe state. After a configured wait period (maybe 60 seconds), the breaker allows a limited number of requests through to test whether the downstream service has recovered. If those probe requests succeed, the breaker closes and normal traffic resumes. If they fail, the breaker returns to the open state and the wait period resets.

The result is that a failing downstream service causes a brief period of errors (while the breaker detects the failure and trips open), followed by immediate failures that do not consume resources (while the breaker is open), followed by automatic recovery when the downstream service comes back.

Implementation Decisions

The circuit breaker concept is simple but the implementation details matter.

Failure threshold. How many failures trigger the breaker? Too sensitive and the breaker trips on normal transient errors. Too insensitive and the cascading failure has already started before the breaker reacts. A percentage-based threshold (50% failure rate) over a time window (last 30 seconds) with a minimum request count (at least 20 requests) works well for most cases. The minimum count prevents the breaker from tripping on a single failed request during low-traffic periods.

Timeout configuration. The circuit breaker should define what "failure" means. A timeout of 5 seconds when the service normally responds in 200 milliseconds is a failure, even if it eventually returns a 200 status. Slow responses that tie up resources are as dangerous as explicit errors.

Fallback behavior. When the breaker is open, what does the caller do? Options include returning cached data (stale but available), returning a default value, returning a degraded response (the page renders without recommendations), or surfacing the error to the user with a clear message. The right fallback depends on the feature. For non-critical data, a cached or default response is usually better than an error.

Monitoring. Circuit breaker state changes are important operational signals. When a breaker trips open, the operations team should know. When a breaker has been open for an extended period, something needs human attention. Publish breaker state changes as events or metrics and alert on them.

In a distributed system with many service-to-service calls, each call site should have its own circuit breaker instance. The payments service might be healthy while the inventory service is down. A single breaker for "all downstream calls" does not provide the granularity needed to maintain partial availability.

Circuit Breakers in Context

The circuit breaker pattern works best as part of a broader resilience strategy. It pairs naturally with several other patterns:

Timeouts define when a slow response counts as a failure. Without proper timeouts, the circuit breaker's failure detection depends on the downstream service eventually returning an error, which might never happen if the connection hangs.

Retries with exponential backoff handle transient failures — the request that fails once but succeeds on the second attempt. The circuit breaker handles sustained failures. The two patterns complement each other: retry for brief glitches, break the circuit for persistent problems.

The bulkhead pattern isolates resources so that a failing downstream service only affects the calls to that service, not the entire application. Circuit breakers and bulkheads together provide both detection (circuit breaker) and containment (bulkhead).

Health checks provide an independent signal about downstream service health. A circuit breaker that considers health check results in addition to request failure rates can trip faster and recover more confidently.

The goal is not perfect availability — that does not exist in distributed systems. The goal is graceful degradation: when a component fails, the system continues operating with reduced functionality rather than cascading into total failure. Circuit breakers are one of the most effective tools for achieving this.

If you are building services that depend on other services and want to design for resilience from the start, let's talk.