3

The Problem

An app I'm maintaining keeps getting socket timeouts after approximately 21000 ms, despite the fact that I've explicitly set longer timeouts. This seemingly magical value of 21000 ms has come up in a few other SO questions and answers, and I'm trying to figure out exactly where it comes from.

Here's the essence of my code:

HttpURLConnection connection = null;
try {
URL url = new URL(urlString);
    connection = (HttpURLConnection) url.openConnection();
    connection.setConnectTimeout(45000);
    connection.setReadTimeout(90000);
    int responseCode = connection.getResponseCode();
    if (responseCode == 200) {
        // code omitted
    }       
} catch (Exception e) {
    // code omitted
}
finally {
    if (connection != null) {
        connection.disconnect();
    }
}

Catching all exceptions in one block is admittedly not ideal, but it's inherited code and I'm reluctant to mess with it. I know it's catching SocketTimeoutException after 21000 ms because it logs the simple name of the exception class.

Clues

I found a question where an asker was getting a ConnectTimeout after 21000 ms, despite explicitly setting it to 40000 ms. That's intriguing despite the exception class being different.

I also found a poorly-explained answer which claims that the server side is responsible for the 21000 ms timeout.

My Hunch

I don't think any action or inaction of the server could cause a shorter-than-expected socket timeout on the client. But maybe the TCP stacks in Windows and Android share a common ancestor, or at least use similar connect retry logic.

Could it be that Android imposes a maximum connect timeout of 21000 ms, and setting a longer timeout in HttpURLConnection is futile? Or could this timeout be triggered by some Windows machine on the path between the mobile device and the server? Do some Android versions throw a SocketTimeoutException where others throw a ConnectException?

Community
  • 1
  • 1
Kevin Krumwiede
  • 9,868
  • 4
  • 34
  • 82
  • The 'poorly explained answer' is certainly wrong. It doesn't make any sense to claim that the behaviour of the server-side platform when performing outbound connects can affect the behaviour of a different platform when making outbound connects to the server platform. – user207421 Nov 12 '14 at 23:06
  • But what about the outbound connect from a router somewhere on the path between the client and server? My understanding of how network errors and failures propagate back to the client is a little fuzzy. – Kevin Krumwiede Nov 13 '14 at 17:49

3 Answers3

7

According to RFC 1122 (TRANSPORT LAYER -- TCP), section 4.2.3.1 ("Retransmission Timeout Calculation"):

"Implementation also MUST include exponential backoff for successive RTO values for the same segment".

So xpa1492's answer sounds plausible (despite its Windows-specific nature); the implementation of a TCP stack either follows this RFC or gets panned for failing to do so.

By the way, RFC 1122 specifies 3 seconds as the initial timeout, explicitly, making xpa1492's (3 + 6 + 12 = 21) answer sound like the answer to your mystery.

And yes, the Android TCP stack shares a common ancestor with Windows TCP stack; they were both created using RFC 1122 as a guide ("[The Linux TCP stack is] an implementation of the TCP protocol defined in RFC 793, RFC 1122 and RFC 2001 with the NewReno and SACK extensions").

I suspect that your problem is related to radio interference, so you might want to try enabling F-RTO, as you might be hitting the "magic number" repeatedly because of the environment in which you are testing.

Community
  • 1
  • 1
Daniel Randall
  • 351
  • 2
  • 13
  • 1
    Are you suggesting that the Android device itself imposes a maximum connect timeout of 21s? Or that some router between the device and the server it's connecting to might be imposing that limit? – Kevin Krumwiede Nov 13 '14 at 22:58
  • Kevin, Yes, it is being imposed by the abstraction layer and/or the TCP stack on the local device; why they did not document it, I do not know. – Daniel Randall Nov 14 '14 at 00:44
  • 1
    Unfortunately, although it appears that this answer *should* be correct, it empirically isn't. I'm leaning toward the inability to override the default timeout being device-specific behavior. Now taking bets on whether the miscreant device turns out to be another [expletive deleted] Samsung... – Kevin Krumwiede Nov 16 '14 at 04:02
  • @KevinKrumwiede, it sounds like the device-specific possibility is the most plausible; these device manufacturers simply don't care anymore, and probably consider consistent API support to be unimportant, unless, of course, a certain carrier wants it for one of their proprietary apps. – Daniel Randall Nov 18 '14 at 01:29
  • 1
    @KevinKrumwiede, you might discover that the problem is a combination of device-specific and carrier-specific, like Galaxy S4 on Verizon. – Daniel Randall Nov 18 '14 at 01:32
4

It seems like it is a Windows default configuration...

https://social.technet.microsoft.com/Forums/windows/en-US/9e7f59dd-6469-4ade-91ca-ceb5bcaf2675/windows-7-tcp-parameter-tcpmaxconnectretransmissions-and-tcpinitialrtt?forum=w7itpronetworking

Based on the link and some further reading, Windows will by default do 3 retries and double the timeout with each attempt, starting a s 3sec one. So you end up with 3sec + 6sec + 12sec = 21sec timeout.

xpa1492
  • 1,953
  • 1
  • 10
  • 19
1

I wrote a crude test app, based on the code in my question, that simulates a connect timeout by attempting to connect to a non-routable address as suggested in this answer. On my Moto G (Android 4.4.2), it throws a SocketTimeoutException in approximately 45 seconds as expected. Curiously, if I do not explicitly set the connect timeout, it instead throws a ConnectException after approximately one minute.

I'm going to write a slightly more sophisticated test app and send it to the customer to try to determine if the device itself is imposing a 21s timeout, or if some router on their mobile network might be the culprit. I'll update this answer with the results.

Result: This appears to be an OS bug that affects the Samsung SPH-P100 (Galaxy Tab 1) from Sprint. I don't have access to a Tab 1 from any other carrier, so this could be blamed on Samsung or Sprint. It does not seem to generally affect Android 2.x, because I have a ZTE X501 running 2.3.6 which allows me to set longer timeouts.

Community
  • 1
  • 1
Kevin Krumwiede
  • 9,868
  • 4
  • 34
  • 82