2

The setup:

We have an https://Main.externaldomain/xmlservlet site, which is authenticating/validating/geo-locating and proxy-ing (slightly modified) requests to http://London04.internaldomain/xmlservlet for example.

There's no direct access to internaldomain exposed to end-users at all. The communication between the sites gets occasionally interrupted and sometimes the internaldomain nodes become unavailable/dead.

The Main site is using org.apache.http.impl.client.DefaultHttpClient (I know it's deprecated, we're gradually upgrading this legacy code) with readTimeout set to 10.000 milli-seconds. The request and response have xml payload/body of variable length and the Transfer-Encoding: chunked is used, also the Keep-Alive: timeout=15 is used.

The problem:

Sometimes London04 actually needs more than 10 seconds (let's say 2 minutes) to execute. Sometimes it non-gracefully crashes. Sometimes other (networking) issues happen. Sometimes during those 2 minutes - the portions of response-xml-data are being so gradually filled that there're no 10-second gaps between the portions and therefore the readTimeout is never exceeded, sometimes there's a 10+ seconds gap and HttpClient times out...

We could try to increase the timeout on Main side, but that would easily bloat/overload the listener pool (just by regular traffic, not even being DDOSed yet). We need a way to distinguish between internal-site-still-working-on-generating-the-response and the cases where it really crashed/network_lost/etc. And a best thing feels to be some kind of heart-beat (every 5 seconds) during the communication.

We thought the Keep-Alive would save us, but it seems to only secure the gaps between the requests (not during the requests) and it seems to not do any heartbeating during the gap (just having/waiting_for the timeout).

We thought chunked-encoding may save us by sending some heartbeat (0-bytes-sized-chunks) to let other side aware, but there seems to be no such/default implementation of supporting any heartbeat this way and moreso it seems that 0-bytes-sized chunk is an EOD indicator itself...

Question(s):

If we're correct in assumptions that KeepAlive/ChunkedEncoding won't help us with achieving the keptAlive/hearbeat/fastDetectionOfDeadBackend then:

1) which layer such a heart-beat should be rather implemented at? Http? tcp?

2) any standard framework/library/setting/etc implementing it already? (if possible: Java, REST)


UPDATE

I've also looked into heartbeat-implementers for WADL/WSDL, though found none for REST, checked out the WebSockets... Also looked into TCP-keepalives which seem to be the right feauture for the task:

BUT according to those I'd have to set up something like:

  • tcp_keepalive_time=5
  • tcp_keepalive_intvl=1
  • tcp_keepalive_probes=3

which seems to be a counter-recommendation (2h is the recommended, 10min already presented as an odd value, is going to 5s sane/safe?? if it is - might be my solution upfront...)

also where should I configure this? on London04 alone or on Main too? (if I set it up on Main - won't it flood client-->Main frontend communication? or might the NATs/etc between sites ruin the keepalive intent/support easily?)

P.S. any link to an RTFM is welcome - I might just be missing something obvious :)

Vlad
  • 1,157
  • 8
  • 15

2 Answers2

4

My advice would be don't use a heartbeat. Have your external-facing API return a 303 See Other with headers that indicates when and where the desired response might be available.

So you might call:

POST https://public.api/my/call

and get back

303 See Other
Location: "https://public.api/my/call/results"
Retry-After: 10

To the extent your server can guess how long a response will take to build, it should factor that into the Retry-After value. If a later GET call is made to the new location and the results are not yet done being built, return a response with an updated Retry-After value. So maybe you try 10, and if that doesn't work, you tell the client to wait another 110, which would be two minutes in total.

Alternately, use a protocol that's designed to stay open for long periods of time, such as WebSockets.

Eric Stein
  • 13,209
  • 3
  • 37
  • 52
  • Polling is not a good solution (especially with RAM overheads for response-buffering and not-really-predictable Retry-After). We looked into WebSockets (see the question), but after You've re-raised it as an option - found some good default (spring) implementation of client&server parts which might be quite compatible with our code (only the transport is changing) - will give it a shot, thank You! – Vlad Feb 28 '19 at 04:55
  • Accepting the WebSocket portion of the answer, thank You! – Vlad Feb 28 '19 at 16:27
1

Take a look SSE

example code: https://github.com/rsvoboda/resteasy-sse

or vertx event-bus: https://vertx.io/docs/apidocs/io/vertx/core/eventbus/EventBus.html

HRgiger
  • 2,750
  • 26
  • 37
  • From https://stackoverflow.com/help/how-to-answer: Provide context for links Links to external resources are encouraged, but please add context around the link so your fellow users will have some idea what it is and why it’s there. Always quote the most relevant part of an important link, in case the target site is unreachable or goes permanently offline. – Eric Stein Feb 26 '19 at 16:21
  • 1
    javax.ws.rs.sse looks like a very viable option (except that the heartbeat has to be server-side-driven instead of having client responding with i'm-still-here heartbeats) - thanks, will up-vote. Though I'm not seeing EventBus being relevant. – Vlad Feb 28 '19 at 05:15
  • @Vlad take a look https://sandny.com/2017/10/21/how-to-create-a-vertx-io-eventbus-browser-js-client-and-use-it-with-a-web-server-with-cors/ – HRgiger Feb 28 '19 at 09:15
  • 1
    Now I follow what You meant for the EventBus (it'd be interesting to see tcpdump of their over-http-communication-protocol), but sadly all the features of vertx require it to start a standalone service/listener (separate from existing webapp container). I've found a working option to proxy/bridge it through Tomcat container, but that option seems to be impacting vertx features by pre-buffering/etc overheads on top of vertx communication: https://stackoverflow.com/questions/36432903/deploy-vert-x-on-tomcat I'll stick to WebSockets answer due to ease of switching of existing code to it, thx! – Vlad Feb 28 '19 at 13:47