How to read plain html file from URLs faster?

Question

I need to "read" a lot of links with slightly different URLs for further parsing. Using this code:

    String charset = "UTF-8";
    System.setProperty("java.net.useSystemProxies", "false");
    //System.out.println(http);
    //System.out.println(html);
    URL pageToRead = new URL(http);

    URLConnection yc = pageToRead.openConnection();
    yc.setRequestProperty("Accept-Charset", charset);
    BufferedReader in = new BufferedReader(new InputStreamReader(
            yc.getInputStream()));

    String inputLine;
    FileWriter fstream = new FileWriter(html);
    BufferedWriter out = new BufferedWriter(fstream);

    while ((inputLine = in.readLine()) != null) {
        out.write(inputLine);
    }


    in.close();
    out.close();

Note about variables: http is String with the full URL. html is String with full file name.

Two questions:

How to change this code to read URLs faster?
Maybe I am wrong and the problem is in the http server. Maybe it just can't give me pages faster. How to check it?

Maybe you could consider using Apache's HTTP client, this would help reading inputs faster if you read from a same server which supports keepalive. Other than that, you should open any stream in front of a try block and close it in finally, otherwise you leak file descriptors. — fge, Dec 24 '12 at 16:42
Forgot to say, every saved html file is 16-20 kbytes. They are barely the same size. Samuel, thanks for editing. — RiantHoff, Dec 24 '12 at 16:54

score 0 · Answer 1 · edited May 23 '17 at 11:48

0

You can time your code and find out how long each method is taking to execute by doing something like this:

long startTime = System.nanoTime();
methodToTime();
long endTime = System.nanoTime();

long duration = endTime - startTime;

Then print it out somewhere.

Source: How do I time a method's execution in Java?

edited May 23 '17 at 11:48

Community

1
1

answered Dec 24 '12 at 16:42

Michael

3,334
20
27

I'am already done that before asking. And i see that reading http is times slower then html file saving. That's why i ask about reading http, not saving html. Duration says every page need to read about .2 seconds. But i don't know maybe it is http server is too slow, not method of reading pages from it. Also, i have noted that pages reading .2 +- .1 seconds, but one of 5-10 files read is way too slower, i.e. 1or 2 seconds. Why is that? – RiantHoff Dec 24 '12 at 16:48
@RiantHoff Probably because that page is significantly bigger than the others. – Michael Dec 24 '12 at 16:51

score 0 · Answer 2 · answered Dec 24 '12 at 17:40

You usually have somewhat high latency reading html (proxies etc.). However a decent webserver can handle a lot of requests concurrently.

In order to get more throughput you should execute several http reads concurrently (use a threadpool).

Depending on the server used, you'll increase your throughput by 10..20 times (also depends wether your client OS is not limited in number of connections).

How to read plain html file from URLs faster?

2 Answers2