1

One of my colleagues asked me this question what the difference between Circuit Breaker and Retry is but I was not able answer him correctly. All I know circuit breaker is useful if there is heavy request payload, but this can be achieve using retry. Then when to use Circuit Breaker and when to Retry.

Also, it is it possible to use both on same API?

Peter Csala
  • 17,736
  • 16
  • 35
  • 75

2 Answers2

4

Several years ago I wrote a resilience catalog to describe different mechanisms. Originally I've created this document for co-workers and then I shared it publicly. Please allow me to quote here the relevant parts.

Retry

Categories: reactive, after the fact

The relation between retries and attempts: n retries means at most n+1 attempts. The +1 is the initial request, if it fails (for whatever reason) then retry logic kicks in. In other words, the 0th step is executed with 0 delay penalty.

There are situation where your requested operation relies on a resource, which might not be reachable in a certain point of time. In other words there can be a temporal issue, which will be gone sooner or later. This sort of issues can cause transient failures. With retries you can overcome these problems by attempting to redo the same operation in a specific moment in the future. To be able to use this mechanism the following criteria group should be met:

  • The potentially introduced observable impact is acceptable
  • The operation can be redone without any irreversible side effect
  • The introduced complexity is negligible compared to the promised reliability

Let’s review them one by one:

  • The word failure indicates that the effect is observable by the requester as well, for example via higher latency / reduced throughput / etc.. If the “penalty“ (delay or reduced performance) is unacceptable then retry is not an option for you.
  • This requirement is also known as idempotent operation. If I call the action with the same input several times then it will produce the exact same result. In other words, the operation acts like it only depends on its parameter and nothing else influences the result (like other objects' state).
  • This condition is even though one of the most crucial, this is the one that is almost always forgotten. As always there are trade-offs (If I introduce Z then it will increase X but it might decrease Y).
    • We should be fully aware of them otherwise it will give us some unwanted surprises in the least expected time.

Circuit Breaker

Categories: proactive, before the fact

It is hard to categorize the circuit breaker because it is pro- and reactive at the same time. It detects that a given downstream system is malfunctioning (reactive) and it protects the downstream systems from being flooded with new requests (proactive).

This is one of the most complex patterns mainly because it uses different states to define different behaviours. Before we jump into the details lets see why this tool exists at all:

Circuit breaker detects failures and prevents the application from trying to perform the action that is doomed to fail (until it is safe to retry) - Wikipedia

So, this tool works as a mini data and control plane. The requests go through this proxy, which examines the responses (if any) and it counts subsequent failures. If a predefined threshold is reached then the transfer is suspended temporarily and it fails immediately.

  • Why is it useful?

It prevents cascading failures. In other words the transient failure of a downstream system should not be propagated to the upstream systems. By concealing the failure we are actually preventing a chain reaction (domino effect) as well.

  • How does it know when a transient failure is gone?

It must somehow determine when would be safe to operate again as a proxy. For example it can use the same detection mechanism that was used during the original failure detection. So, it works like this: after a given period of time it allows a single request to go through and it examines the response. If it succeeds then the downstream is treated as healthy. Otherwise nothing changes (no request is transferred through this proxy) only the timer is reset.

  • What states does it use?

The circuit breaker can be in any of the following states: Closed, Open, HalfOpen.

  • Closed: It allows any request. It counts successive failed requests.
    • If the successive failed count is below the threshold and the next request succeeds then the counter is set back to 0.
    • If the predefined threshold is reached then it transitions into Open
  • Open: It rejects any request immediately. It waits a predefined amount of time.
    • If that time is elapsed then it transitions into HalfOpen
  • HalfOpen: It allows only one request. It examines the response of that request:
    • If the response indicates success then it transitions into Closed
    • If the response indicates failure then it transitions back to Open

Resiliency strategy

The above two mechanisms / policies are not mutually exclusive, on the contrary. They can be combined via the escalation mechanism. If the inner policy can't handle the problem it can propagate one level up to an outer policy.

When you try to perform a request while the Circuit Breaker is Open then it will throw an exception. Your retry policy could trigger for that and adjust its sleep duration (to avoid unnecessary attempts).

The downstream system can also inform upstream that it is receiving too many requests with 429 status code. The Circuit Breaker could also trigger for this and use the Retry-After header's value for its sleep duration.

So, the whole point of this section is that you can define a protocol between client and server how to overcome on transient failures together.

Peter Csala
  • 17,736
  • 16
  • 35
  • 75
2

The Retry pattern enables an application to retry an operation in hopes of success.

The Circuit Breaker pattern prevents an application from performing an operation that is likely to fail.

Retry - Retry pattern is useful in scenarios of transient failures. What does this mean? Failures that are "temporary", lasting only for a short amount of time are transient. A momentary loss of network connectivity, a brief moment when the service goes down or is unresponsive and related timeouts are examples of transient failures.

As the failure is transient, retrying after some time could possibly give us the result needed

Circuit Breaker - Circuit Breaker pattern is useful in scenarios of long lasting faults. Consider a loss of connectivity or the failure of a service that takes some time to repair itself. In such cases, it may not be of much use to keep retrying often if it is indeed going to take a while to hear back from the server. The Circuit Breaker pattern wants to prevent an application from performing an operation that is likely to fail.

The Circuit Breaker keeps a tab on the number of recent failures, and on the basis of a pre-determined threshold, determines whether the request should be sent to the server under stress or not.

  • So, when a circuit breaker will make a call to server? I mean how it will know if the client server is now ready to server? and is it possible to use both circuit breaker along with retry? – Satyaprakash Nayak Dec 21 '22 at 12:23
  • 1
    Closed–When everything is normal, the circuit breaker remains in the closed state and all calls pass through to the services. When the number of failures exceeds a predetermined threshold the breaker trips, and it opens up. Open –circuit breaker returns an error for calls without executing the function. Half-Open – After a timeout period, the circuit switches to a half-open state to test if the underlying problem still exists. If a single call fails in this half-open state, the breaker is once again tripped. If it succeeds, the circuit breaker resets back to the normal closed state. – Abhishek Mahajan Dec 22 '22 at 06:51
  • Hi Abhishek, sounds good to me. Can you put this in your answer as well. Also, please ans. my last query. is it possible to use both circuit breaker along with retry? – Satyaprakash Nayak Dec 22 '22 at 11:55
  • It's definitely possible to have retries that go to the circuit-breaker, though it's worth noting that when the breaker trips, the retries will fail-fast. So it will probably make sense to have the retries exponentially backoff (e.g. retry after 100 ms the first time, then 200 ms the second time, 400 ms, 800 ms, 1.6s, etc., ignoring the jitter that a good implementation of exponential backoff will probably introduce). – Levi Ramsey Dec 23 '22 at 16:53