34

I'm trying to make a few microservices more resilient and retrying certain types of HTTP requests would help with that.

Retrying timeouts will give clients a terribly slow experience, so I don't intend to retry in this case. Retrying 400s doesn't help because a bad request will remain a bad request a few milliseconds later.

I imagine there are other reasons to not retry a few other types of errors, but which errors and why?

youngrrrr
  • 3,044
  • 3
  • 25
  • 42
cahen
  • 15,807
  • 13
  • 47
  • 78
  • "Retrying timeouts will give clients a terribly slow experience" - what do you mean? – Constantin Galbenu Dec 07 '17 at 07:13
  • if the timeout threshold is set to 5 seconds and you retry twice, the client could be waiting for 15 seconds in total to only get an error in the end – cahen Dec 07 '17 at 09:59
  • I would suggest you consider exponential backoff with jitter for your retries, where the initial delay is sub-second. That way initial retries can happen quickly but the system won't ever get flooded because of the backoff. Ideally, use a library like Spring Retry or Failsafe that has a nice API and implements the backoff strategies for you. – Sam Mefford Apr 28 '20 at 15:49
  • 1
    If someone has a reverse question, see [What are the http codes to automatically retry the request?](https://stackoverflow.com/questions/51770071/what-are-the-http-codes-to-automatically-retry-the-request/74627395#74627395) – Michael Freidgeim Mar 03 '23 at 21:55

2 Answers2

43

There are some errors that should not be retried because they seem permanent:

  • 400 Bad Request
  • 401 Unauthorized
  • 402 Payment Required
  • 403 Forbidden
  • 405 Method Not Allowed
  • 406 Not Acceptable
  • 407 Proxy Authentication Required
  • 409 Conflict - it depends
  • 410 Gone
  • 411 Length Required
  • 412 Precondition Failed
  • 413 Payload Too Large 
  • 414 URI Too Long
  • 415 Unsupported Media Type
  • 416 Range Not Satisfiable
  • 417 Expectation Failed
  • 418 I'm a teapot - not sure about this one
  • 421 Misdirected Request
  • 422 Unprocessable Entity
  • 423 Locked - it depends on how long a resource is locked in average (?)
  • 424 Failed Dependency
  • 426 Upgrade Required - can the client be upgraded automatically?
  • 428 Precondition Required - I don't thing that the precondition can be fulfiled the second time without retring from the beginning of the whole process but it depends
  • 429 Too Many Requests - it depends but it should not be retried to fast
  • 431 Request Header Fields TooLarge
  • 451 Unavailable For Legal Reasons

So, most of the 4** Client errors should not be retried.

The 5** Servers errors that should not be retried:

  • 500 Internal Server Error - it depends on the cause of the error
  • 501 Not Implemented
  • 502 Bad Gateway - I saw used for temporary errors so it depends
  • 505 HTTP Version Not Supported
  • 506 Variant Also Negotiates
  • 507 Insufficient Storage
  • 508 Loop Detected
  • 510 Not Extended
  • 511 Network Authentication Required

However, in order to make the microservices more resilient you should use the Circuit breaker pattern and fail fast when the upstream is down.

Community
  • 1
  • 1
Constantin Galbenu
  • 16,951
  • 3
  • 38
  • 54
  • very interesting list. What would you say is the reason for not retrying a 500? – cahen Dec 16 '17 at 16:08
  • 4
    @cahen from what I've seen, programming errors are reported as 500. How fast do you think a BUG is patched? 300ms? :) It depends on the heuristics. RFC 2616 says that the condition (temporary vs permanent) should be included in the response: `the server SHOULD include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition`. – Constantin Galbenu Dec 16 '17 at 16:56
  • 36
    500 could be anything on the server e.g. a database communication glitch that would be gone on the next retry. I would retry 500 – Stig Feb 23 '19 at 21:30
  • 1
    @ConstantinGalbenu, I agree with programming errors being reported as 500s, but not all 500s are programming errors. Sometimes it's possible to recover as in the example above. – cahen Mar 21 '19 at 14:24
  • This probably depends on the service you are interfacing with. I currently use Rackspace's Cloud Files API for instance and I've been experiencing some really strange issues with them on random occasions beyond my control. An example is valid auth tokens would randomly stop working for a few seconds (returns a 401), which they acknowledge is a bug (which hasn't been fixed for years now). Another is uploaded files would sometimes show up as 0 bytes (haven't tested this much yet, but they allow etag validations that return a 422 on failure). These usually resolve themselves after a retry. – georaldc Jul 11 '19 at 21:21
  • @georaldc If they admit that this is a bug then that is out of scope of this question+answer (a bug means it doesn't work as intended). And yes, it always depends when humans are involved so basically all the time when 2 systems communicate (they are still made by humans). – Constantin Galbenu Jul 17 '19 at 06:16
  • 1
    Would recommend to retry for 502s. See those happening a lot with APIs and retries usually helps. – Erik Kalkoken Mar 31 '20 at 11:22
  • I think for database / downstream system issues, then 'Bad Gateway' is applicable. If you _know_ it's transient, then maybe a [503](https://www.rfc-editor.org/rfc/rfc2616#section-10.5.4) – Steve Dunn May 03 '22 at 11:11
  • @cahen why not accepting this post as the answer? – rufreakde Mar 29 '23 at 11:40
  • 1
    @rufreakde good question. I never marked this as the accepted answer before because I think that 500 should be retried. Unexpected server errors are probably the best use case for a retry actually. I can see that this specific line was updated 5 years later saying "it depends on the cause of the error", so this is good. I'd also retry 502 and 507. That being said, since 2021 I think it should be the accepted answer because I don't think there's a perfect list for all scenarios and this covers a lot of ground really well. Thanks for bringing this up – cahen Mar 29 '23 at 14:11
  • A terrible opinion. E.g. 401 it's just a normal return code if your token has expired and the client just needs to update the token and try again. – Webaib Apr 27 '23 at 12:04
10

4xx codes mean that an error has been made at the caller's side. That could be a bad URL, bad authentication credentials or anything that indicates it was a bad request. Therefore, without fixing that problem, there isn't an use of retry. The error is in caller's domain and caller should fix it instead of hoping that it will fix itself.

There are exceptions. Let's imagine the service is being redeployed or restarted. At that instance, there is no endpoint registered and hence will send 4xx http code. However, a moment later, the server could be available. A retry might therefore seem beneficial.

A deeper analysis will indicate that a service, when restarted, should be a rolling restart to prevent outage. Therefore, the previous argument no longer holds true. However, if your environment/ecosystem does not follow this practice and you believe client side reported error (4xx codes) are worth retry due to aforementioned reason, then you may choose to do so; but mature systems won't do that due to no benefits perceived and losing the fail fast ability.

5xx error codes should be retried as those are service errors. They could be short term (overflowing threads, dependent service refusing connections) or long term (system defect, dependent system outage, infrastructure unavailable). Sometimes, services reply back with the information (often headers) whether this is permanent or temporary; and sometimes a time parameter as to when to retry. Based on these parameters, callers can choose to retry or not.

1xx, 2xx and 3xx codes need not be retried for obvious reasons.

ArinBhat
  • 109
  • 1
  • 4
  • 13
    These generalizations don't seem to apply in all cases. I can see how a 408 (timeout) or 409 (conflict) and 429 (too many requests) could be retryable, whereas the other 400-level errors don't seem retryable. I can also see how 100 is expecting another request, as are most 300-level. Perhaps those are not retries, but they're also not errors... And most 500-level should not be retried. In the 5xx range, I think I would only retry 500 (internal server error, 502 (Bad Gateway), 503 (Service Unavailable), 504 (Gateway Timeout), and 599 (Network Connect Timeout Error). – Sam Mefford Apr 28 '20 at 15:46
  • Who told you that the retry must be 100% with the same request? – Webaib Apr 27 '23 at 12:35