The setup:
We have an https://Main.externaldomain/xmlservlet site, which is authenticating/validating/geo-locating and proxy-ing (slightly modified) requests to http://London04.internaldomain/xmlservlet for example.
There's no direct access to internaldomain exposed to end-users at all. The communication between the sites gets occasionally interrupted and sometimes the internaldomain nodes become unavailable/dead.
The Main site is using org.apache.http.impl.client.DefaultHttpClient (I know it's deprecated, we're gradually upgrading this legacy code) with readTimeout set to 10.000 milli-seconds.
The request and response have xml payload/body of variable length and the Transfer-Encoding: chunked
is used, also the Keep-Alive: timeout=15
is used.
The problem:
Sometimes London04 actually needs more than 10 seconds (let's say 2 minutes) to execute. Sometimes it non-gracefully crashes. Sometimes other (networking) issues happen. Sometimes during those 2 minutes - the portions of response-xml-data are being so gradually filled that there're no 10-second gaps between the portions and therefore the readTimeout is never exceeded, sometimes there's a 10+ seconds gap and HttpClient times out...
We could try to increase the timeout on Main side, but that would easily bloat/overload the listener pool (just by regular traffic, not even being DDOSed yet). We need a way to distinguish between internal-site-still-working-on-generating-the-response and the cases where it really crashed/network_lost/etc. And a best thing feels to be some kind of heart-beat (every 5 seconds) during the communication.
We thought the Keep-Alive would save us, but it seems to only secure the gaps between the requests (not during the requests) and it seems to not do any heartbeating during the gap (just having/waiting_for the timeout).
We thought chunked-encoding may save us by sending some heartbeat (0-bytes-sized-chunks) to let other side aware, but there seems to be no such/default implementation of supporting any heartbeat this way and moreso it seems that 0-bytes-sized chunk is an EOD indicator itself...
Question(s):
If we're correct in assumptions that KeepAlive/ChunkedEncoding won't help us with achieving the keptAlive/hearbeat/fastDetectionOfDeadBackend then:
1) which layer such a heart-beat should be rather implemented at? Http? tcp?
2) any standard framework/library/setting/etc implementing it already? (if possible: Java, REST)
UPDATE
I've also looked into heartbeat-implementers for WADL/WSDL, though found none for REST, checked out the WebSockets... Also looked into TCP-keepalives which seem to be the right feauture for the task:
- https://en.wikipedia.org/wiki/Keepalive
- http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
- Socket heartbeat vs keepalive
- WebSockets ping/pong, why not TCP keepalive?
BUT according to those I'd have to set up something like:
- tcp_keepalive_time=5
- tcp_keepalive_intvl=1
- tcp_keepalive_probes=3
which seems to be a counter-recommendation (2h is the recommended, 10min already presented as an odd value, is going to 5s sane/safe?? if it is - might be my solution upfront...)
also where should I configure this? on London04 alone or on Main too? (if I set it up on Main - won't it flood client-->Main frontend communication? or might the NATs/etc between sites ruin the keepalive intent/support easily?)
P.S. any link to an RTFM is welcome - I might just be missing something obvious :)