Spring-cloud Zuul retry when instance is down

Question

Using Spring-cloud Angel.SR6:

Here is the configuration of my Spring-boot app with @EnableZuulProxy:

server.port=8765

ribbon.ConnectTimeout=500
ribbon.ReadTimeout=5000
ribbon.MaxAutoRetries=1
ribbon.MaxAutoRetriesNextServer=1
ribbon.OkToRetryOnAllOperations=true

zuul.routes.service-id.retryable=true

I have 2 instances of service-id running on random ports. These instances, as well as the Zuul instance, successfully register with Eureka, and I can access RESTful endpoints on the 2 service-id instances by accessing http://localhost:8765/service-id/.... and find that they are balanced in a round-robin manner.

I would like to kill one of the service-id instances and, when that defunct instance is next in line for forwarding, have Zuul attempt to contact it, fail, and retry with the other instance.

Is this possible, or am I misreading the documentation? When I try the above config, the request 'destined' for the defunct instance fails with a 500 Forwarding error. From the Zuul stacktrace:

com.netflix.zuul.exception.ZuulException: Forwarding error
    at org.springframework.cloud.netflix.zuul.filters.route.RibbonRoutingFilter.forward(RibbonRoutingFilter.java:140)

....

Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: service-idRibbonCommand timed-out and no fallback available

The subsequent request succeeds as expected. This behavior continues until the defunct instance is removed from Zuul's registry.

EDIT: Updated to Brixton.M5. No change in behavior. Here's the Hystrix exception in more detail:

Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: service-id timed-out and no fallback available.
    at com.netflix.hystrix.AbstractCommand$16.call(AbstractCommand.java:806) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.AbstractCommand$16.call(AbstractCommand.java:790) ~[hystrix-core-1.4.23.jar:1.4.23]
    at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$1.onError(OperatorOnErrorResumeNextViaFunction.java:99) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:70) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:70) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:70) ~[rxjava-1.0.14.jar:1.0.14]
    at com.netflix.hystrix.AbstractCommand$DeprecatedOnFallbackHookApplication$1.onError(AbstractCommand.java:1521) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.AbstractCommand$FallbackHookApplication$1.onError(AbstractCommand.java:1411) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:314) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:306) ~[hystrix-core-1.4.23.jar:1.4.23]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable.unsafeSubscribe(Observable.java:7710) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$1.onError(OperatorOnErrorResumeNextViaFunction.java:100) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:70) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:70) ~[rxjava-1.0.14.jar:1.0.14]
    at com.netflix.hystrix.AbstractCommand$HystrixObservableTimeoutOperator$1.run(AbstractCommand.java:958) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable$1.call(HystrixContextRunnable.java:41) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable$1.call(HystrixContextRunnable.java:37) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable.run(HystrixContextRunnable.java:57) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.AbstractCommand$HystrixObservableTimeoutOperator$2.tick(AbstractCommand.java:978) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.util.HystrixTimer$1.run(HystrixTimer.java:100) ~[hystrix-core-1.4.23.jar:1.4.23]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_66]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[na:1.8.0_66]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_66]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[na:1.8.0_66]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_66]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_66]
    ... 1 common frames omitted

Caused by: java.util.concurrent.TimeoutException: null
    at com.netflix.hystrix.AbstractCommand$9.call(AbstractCommand.java:601) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.AbstractCommand$9.call(AbstractCommand.java:581) ~[hystrix-core-1.4.23.jar:1.4.23]
    at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$1.onError(OperatorOnErrorResumeNextViaFunction.java:99) ~[rxjava-1.0.14.jar:1.0.14]
    ... 15 common frames omitted

It works for me with Brixton snapshots. I didn't set all those "ribbon.*" properties, so maybe that is actually hurting somehow, or maybe it's something that only works with the newer stack? Can you try Brixton? — Dave Syer, Mar 03 '16 at 13:57
Can you show the rest of the `zuul.routes` configuration. Are you doing a GET operation? — spencergibb, Mar 03 '16 at 19:17
@DaveSyer I switched to Brixton.M5 and removed the `ribbon.*` properties but it hasn't made any difference. I did notice that, instead of setting `eureka.instance.metadata.instanceId`, I had to set `eureka.instance.instanceId` to differentiate b/w my instances (b/c they're running on `server.port=0`). — grinder, Mar 03 '16 at 19:26
@spencergibb: The only other zuul config I have is `zuul.ignoredServices`, which is set to remove Zuul and Eureka itself from routing.... And, yes, it is a GET (though, ultimately I'd like retry on POST, PUT and DELETE) — grinder, Mar 03 '16 at 19:28
Maybe max retries "1" is too few then. You have to let it try more than once? — Dave Syer, Mar 03 '16 at 21:43
@DaveSyer I tried these settings: `ribbon.MaxAutoRetries=2` and `ribbon.MaxAutoRetriesNextServer=2`, but it didn't seem to make any difference. Will continue to poke around and report if I see anything interesting (or make any progress with this). — grinder, Mar 08 '16 at 13:45
@grinder, did you solve your problem? did you manage to get it to retry on a different server? — phoenix7360, May 17 '16 at 08:23
we just recently updated our Maven dependencies to Brixton; now GETs are retried successfully. And this happens with no `ribbon.*` properties set. — grinder, May 24 '16 at 14:09
oops - also set `hystrix.command.default.execution.timeout.enabled=false` YMMV — grinder, May 24 '16 at 14:28

Andre · Answer 1 · 2017-06-06T09:54:28.123

I had the same problem. This solved it for me:

Regarding to this article ribbon is only retring when the http client is set to ribbon's restclient. On default ribbon is using the Apache http client which does not retry any request.

Due the fact that ribbon's restclient is deprecated you should consider using spring-retry (https://github.com/spring-projects/spring-retry)

Keep in mind that you also have to handle the hystrix timeouts for zuul as well when you configure your retries on ribbon.

score 0 · Answer 2 · answered Jun 05 '17 at 07:02

Ribbon uses the registered serviced in eureka, so it is up to eureka to update service status and let caller knows the available servers.

In my understanding, when one server is down, there are 2 ways to know:
1. wait for eureka server to update service status. But this update will take some time, 30 seconds as default.
2. try to call and mark it as down, (maybe will confirm with eureka server later)

So, in you question, you said after the first request failed, subsequent request succeeds. I think it is right behavior.

Spring-cloud Zuul retry when instance is down

2 Answers2

Linked