6

I'm using Java to stream files from Amazon S3, on Linux (Ubuntu 10) 64-bit servers.

I'm using a separate thread for each file, and each file opens an HttpURLConnection which downloads and processes each file concurrently.

Everything works beautifully until I reach a certain number of streams (usually around 2-300 concurrent streams). At irregular points after this, several (say 10) of the threads will start experiencing java.net.IOException: Connection reset errors simultaneously.

I am throttling the download speed, and am way below the 250mbit/s limit of an m1.large instance. There is also insignificant load on all other server aspects (e.g. CPU, load average and memory usage are all fine).

What could be causing this, or how could I track it down?

netflux
  • 3,648
  • 6
  • 37
  • 40
  • 1
    One sec, let me get my magic 8-ball ;) – Brian Roach Feb 02 '12 at 17:34
  • 1
    Well either that, or any experience or advice you can offer much appreciated ;) – netflux Feb 02 '12 at 17:39
  • It's possible that some intermediate point is limiting your connections, like a company firewall perhaps. – jtahlborn Feb 02 '12 at 17:55
  • No firewall present, the server is on Amazon EC2 connecting directly to S3 – netflux Feb 02 '12 at 17:56
  • The remote server is severing the connection (or their network is freaking). It's possible someone here has hit the same problem though. – Brian Roach Feb 02 '12 at 17:56
  • 6
    Since you are on an EC2 virt talking to S3, you should send this question to AWS support at Amazon. There could be any number of reasons for your connections dropping and they may be able to help you diagnose the problem from tools they can access. You may also want to grab a tcpdump when you get to the point of a likely error and see if there is anything of interest there. – philwb Feb 07 '12 at 14:15
  • [Previous issue](http://stackoverflow.com/questions/585599/whats-causing-my-java-net-socketexception-connection-reset) @philweb is right on about using tcpdump – wort Feb 12 '12 at 04:32
  • I'd suggest using netstat to see how many TCP connections you have and what state they're in. You should probably try to maximize keep-alive cache utilization. – Samuel Edwin Ward Feb 13 '12 at 18:59

5 Answers5

4

not trivial to guess what may happen but this is a couple of hints , may be some may apply into your context:

  • can you check your shell (linux bash /zsh or any other) to see if you raise up the standard limits restricting the number of file descriptors (but sockets too), man ulimit with bash shell
  • did you close the streams explicitly in your Java code ? not closing streams may induce such clever problems
  • try to google for Linux TCP kernel tuning to try to see if your ubuntu server has a well suited stack for such load context...

HTH Jerome

romje
  • 650
  • 3
  • 4
2

They might have spillover problem at VIPs because of number of con-current connections reached the limit. You may decrease the size and see...

yadab
  • 2,063
  • 1
  • 16
  • 24
0

The problem here is largely in your language. The high load is triggering the error condition, and the error condition results in the exception. Not the other way around.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • Can you give any more details as to why you think this is? As I said I am well within network, CPU and memory bounds. – netflux Feb 06 '12 at 12:48
  • @robw Exceptions don't 'trigger' things. They are triggered by error conditions. The error condition here is clearly the overload. – user207421 Feb 07 '12 at 09:03
  • Perhaps you misunderstand me. I'm quite aware of what exceptions are and how they work. You could well also be right that load is triggering this, but the answer isn't simply "download less". An m1.large should be able to handle this much workload. I want to know how to understand the exact cause of the disconnections, and fix them. – netflux Feb 07 '12 at 13:24
  • @robw I didn't say anything about 'download less', but according to your own post it is definitely load-related. Possibly you just have too many connections for the OS. – user207421 Feb 08 '12 at 09:17
0

One relatively common reason for problems like this is that an intermediate proxy (firewall, load balancer) drops what it deems inactive (or too long-lived) HTTP connection. But beyond this general possibility, EC2 definitely has more kinks as others have suggested.

StaxMan
  • 113,358
  • 34
  • 211
  • 239
0

You are probably running out of ephemeral ports. This happens under load when many short lived connections are opened and closed rapidly. The standard Java HttpURLConnection is not going to get you the flexibility you need to set the proper socket options. I recommend going with the Apache HttpComponents project, and setting options like so...

...
HttpGet httpGet = new HttpGet(uri);
HttpParams params = new BasicHttpParams();
params.setParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, 16 * 1000); // 16 seconds
params.setParameter(CoreConnectionPNames.SO_REUSEADDR, true); // <-- teh MOJO!

DefaultHttpClient httpClient = new DefaultHttpClient(connectionManager, params);
BasicHttpContext httpContext = new BasicHttpContext();
HttpResponse httpResponse = httpClient.execute(httpGet, httpContext);

StatusLine statusLine = httpResponse.getStatusLine();
if (statusLine.getStatusCode() >= HTTP_STATUS_CODE_300)
{
...

I've omitted some code, like the connectionManager setup, but you can grok that from their docs.

[Update] You might also add params.setParameter(CoreConnectionPNames.SO_LINGER, 1); to keep ephemeral ports from lingering around before reclamation.

Community
  • 1
  • 1
brettw
  • 10,664
  • 2
  • 42
  • 59