10

I'm trying to find the most efficient way to test 300,000+ URLs in a database to basically check if the URLs are still valid. Having looked around the site I've found many excellent answers and am now using something along the lines of:

Read URL from file.... Test URL:

        final URL url = new URL("http://" + address);
        final HttpURLConnection urlConn = (HttpURLConnection) url.openConnection();
        urlConn.setConnectTimeout(1000 * 10);
        urlConn.connect();
        urlConn.getResponseCode(); // Do something with the code
        urlConn.disconnect();

Write details back to file....

So a couple of questions: 1) Is there a more efficient way to test URLs and get response codes?

2) Initially I am able to test about 50 URLs per minute, but after 5 or so minutes things really slow down - I imagine there is some resources I'm not releasing but am not sure what

3) Certain URLs (e.g. www.bhs.org.au) will cause the above to hang for minutes (not good when I have so many URLs to test) even with the connect timeout set, is there anyway I can tighten this up?

Thanks in advance for any help, it's been a quite a few years since I've written any code and I'm starting again from scratch :-)

Phil
  • 101
  • 3
  • 2
    http://stackoverflow.com/a/272918/166390 - as far mitigating blocking or increasing throughput, see threads or other distribution –  Mar 05 '13 at 18:21
  • 1
    I don't know Java well, but in order to avoid (3), I'd suggest making connections from multiple threads. I'd create a thread pool, and keep maybe 20-50 connections going at any one time. This way, timeouts won't clog the system. – Gort the Robot Mar 05 '13 at 18:24

2 Answers2

5

This may or may not help, but you might want to change your request method to HEAD instead of using the default, which is GET:

urlConn.setRequestMethod("HEAD");

This tells the server that you do not really need a response back, other than the response code.

The article What Is a HTTP HEAD Request Good for describes some uses for HEAD, including link verification:

[Head] asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.... This can be used for example for creating a faster link verification service.

chue x
  • 18,573
  • 7
  • 56
  • 70
  • link is dead. Here is [another one](https://ochronus.com/http-head-request-good-uses/) with the same title. Don't know if the content is the same too. – Alexander Malakhov Jul 15 '14 at 10:25
5

By far the fastest way to do this would be to use java.nio to open a regular TCP connection to your target host on port 80. Then, simply send it a minimal HTTP request and process the result yourself.

The main advantage of this is that you can have a pool of 10 or 100 or even 1000 connections open and loading at the same time rather than having to do them one after the other. With this, for example, it won't matter much if one server (www.bhs.org.au) takes several minutes to respond. It'll simply hog one of your many connections in the pool, but others will keep running.

You could also achieve that same thing with a little more overhead but a lot less complex coding by using a Thread Pool to run many HttpURLConnections (the way you are doing it now) in parallel in multiple threads.

Markus A.
  • 12,349
  • 8
  • 52
  • 116