2

I have a SwingWorker that gets a list of URLs (can be several hundred). In the doInBackground() method it loops through these URLs, creates each time a CloseableHttpClient, managed by a PoolingHttpClientConnectionManager, and a HttpGet. The client will then execute the httpGet and write the content (image) to a file (if possible! Not all URLs are valid and some return a 404).

This works for approximately 100 requests just until the maxTotal (or the defaultMaxPerRoute) of the connectionManager is reached. Then everything stops, the client's execution stops and no Exception gets thrown.

So I figured, I'd set the maxTotal and the defaultMaxPerRoute to 1000 and check it out. It works when I try downloading 1500 images, but it just feels so wrong! I want to reuse the clients instead of having 1000 of them in a pool.

Socket- or ConnectionTimeouts didn't work, Debugging does not tell me what is happening and creating new HttpClients without the PoolingHttpClientConnectionManager doesn't work either. Closing the clients and/or the results does not work either.

How should I manage the clients or setup the pool to make sure that my SwingWorker could even download thousands of images?

I'll try to break down my SwingWorker code to the important parts: (Almost forgot, it's actually a list of objects that's looped and each object has 3 URLs)

// this is how I init the connectionManager (outside the swing worker)
this.connectionManager = new PoolingHttpClientConnectionManager();
this.connectionManager.setMaxTotal(100);
this.connectionManager.setDefaultMaxPerRoute(100);

// this is where the images are downloaded in the swing worker
@Override
protected Void doInBackground() throws Exception{
    for(MyUrls myUrls : myUrlsList){
        client = HttpClients.custom().setConnectionManager(connectionManager).build();         
        for(MyImage image : myUrls.getImageList()){
            File outputFile = null;
            HttpEntity entity = null;
            switch(image.getImageSize()){
                case 1:
                    HttpGet httpGet = new HttpGet(image.getUrl()));
                    httpGet.setConfig(RequestConfig.custom().setSocketTimeout(1000).setConnectTimeout(1000).build());  // doesn't change anything
                    response = client.execute(httpGet);
                    if(response.getStatusLine().getStatusCode() >= 200 && response.getStatusLine().getStatusCode() < 300){
                        entity = response.getEntity();
                        if(entity != null){
                            String contentType = entity.getContentType().getValue();
                            String extension = "." + (contentType.contains("/") ? contentType.substring(contentType.indexOf("/") + 1) : "jpg");
                            outputFile = new File(image.getName()+extension);
                        }
                    }else{
                        System.err.println("download of "+image.getName()+extension+" failed: " + response.getStatusLine().getStatusCode());
                    }
                    break;
                case 2:
                   // the other cases are pretty much the same
            }
            if(entity != null && outputFile != null){
                try(InputStream inputStream = entity.getContent(); FileOutputStream outputStream = new FileOutputStream(outputFile)){
                    byte[] buffer = new byte[1024];
                    int bytesRead;
                    while((bytesRead = inputStream.read(buffer)) != -1){
                        outputStream.write(buffer, 0, bytesRead);
                    }
                }
            }
        }
    }
}
GameDroids
  • 5,584
  • 6
  • 40
  • 59

1 Answers1

8

You may be leaking connections due to the fact that you only properly dispose of them when you have OK responses (in the range 200 to 300) that you can dump to your file output stream properly.

Long story short : always dispose of the entity

Disposing of the entity (the "content" of the resposne) is the only way to free the connexion object. To that end, you can call EntityUtils.consume(responseEntity)) and/or close the response object (call response.close()) and/or the close or read up to its end the entity's stream (call stream.close()) in a finally clause for each request/response cycle.

Your code actually consumes the response's stream, but only for certain cases (mainly: when the request succeeds), so that is a proper use of the API, but not otherwise (e.g. 404 responses), that you never consume nor release.

You can also use the ResponseHandler API variant for that, and it will hande ALL of this for you - i'd advise you to use it.

If you do not always do that, the connection will be considered active by HTTPClient, and it will never be released, nor reused. At some point you will end up with either an empty connection pool, or a system error (e.g. too many files open).

Please note that it's not only for 200 OK response, it's for every call (even a 404 usually has a response body, if you do not consume that body, then the connection will not be freed).

Hints from the user guide

I refer you to the "fundamentals" part of the user guide : https://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html

Here is an example of request execution process in its simplest form:

CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpget = new HttpGet("http://localhost/");
CloseableHttpResponse response = httpclient.execute(httpget);
try {
    <...>
} finally {
    response.close();
}

The very important part is the close() in the finally close.

1.1.5. Ensuring release of low level resources

In order to ensure proper release of system resources one must close either the content stream associated with the entity or the response itself

And :

1.1.6. Consuming entity content

The recommended way to consume the content of an entity is by using its HttpEntity#getContent() or HttpEntity#writeTo(OutputStream) methods. HttpClient also comes with the EntityUtils class, which exposes several static methods to more easily read the content or information from an entity. Instead of reading the java.io.InputStream directly, one can retrieve the whole content body in a string / byte array by using the methods from this class. However, the use of EntityUtils is strongly discouraged unless the response entities originate from a trusted HTTP server and are known to be of limited length.

About the response handler variant :

1.1.8. Response handlers

The simplest and the most convenient way to handle responses is by using the ResponseHandler interface, which includes the handleResponse(HttpResponse response) method. This method completely relieves the user from having to worry about connection management.

Hints from stack overflow

You can also read this : Why did the author use EntityUtils.consume(httpEntity);?

Sidenote

The point of having multithreaded connection managers is that you can have ONE (and only one) client, and share it for your whole application. You do not need to build one client per request and have them share a single connection manager. Although you can.
Having a single client could help if you use caching, cookies, authentication, ... thant can not work when creating a new client for each request.

1.2.1. HttpClient thread safety

HttpClient implementations are expected to be thread safe. It is recommended that the same instance of this class is reused for multiple request executions.

Community
  • 1
  • 1
GPI
  • 9,088
  • 2
  • 31
  • 38
  • I don't know why I didn't use `ResponseHandler` right away. I use it in other parts of the application, but not here. I just tried `EntityUtils,consume()` and `response.close()` and it got me a little further – but not all the way, so I'll try it with the `ReponseHandler` now. – GameDroids Oct 19 '16 at 15:05
  • Thank you for this detailed answer and your sidenotes. You are right, implementing my own `ResponseHandler` did the job! Just downloaded 7.500 Images without a problem. – GameDroids Oct 19 '16 at 16:12
  • Do you find you still need to ensure EntityUtils.consume sometimes with try with resource approach? – ledlogic Apr 11 '23 at 07:09
  • I find that using the API the way it is documented is the only way to go :-). Quoting back " 1.1.5. Ensuring release of low level resources: to ensure proper release of system resources one must close either the content stream associated with the entity or the response itself". Now there are a few ways to do that, exposed here or in the documentation. In the above paragraph, one sees that try with resources on the `response` object is one of the approved ways. – GPI Apr 11 '23 at 12:53