-2

I need to "read" a lot of links with slightly different URLs for further parsing. Using this code:

    String charset = "UTF-8";
    System.setProperty("java.net.useSystemProxies", "false");
    //System.out.println(http);
    //System.out.println(html);
    URL pageToRead = new URL(http);

    URLConnection yc = pageToRead.openConnection();
    yc.setRequestProperty("Accept-Charset", charset);
    BufferedReader in = new BufferedReader(new InputStreamReader(
            yc.getInputStream()));

    String inputLine;
    FileWriter fstream = new FileWriter(html);
    BufferedWriter out = new BufferedWriter(fstream);

    while ((inputLine = in.readLine()) != null) {
        out.write(inputLine);
    }


    in.close();
    out.close();

Note about variables: http is String with the full URL. html is String with full file name.

Two questions:

  1. How to change this code to read URLs faster?
  2. Maybe I am wrong and the problem is in the http server. Maybe it just can't give me pages faster. How to check it?
Bhavik Ambani
  • 6,557
  • 14
  • 55
  • 86
RiantHoff
  • 11
  • 2
  • 1
    Maybe you could consider using Apache's HTTP client, this would help reading inputs faster if you read from a same server which supports keepalive. Other than that, you should open any stream in front of a try block and close it in finally, otherwise you leak file descriptors. – fge Dec 24 '12 at 16:42
  • Forgot to say, every saved html file is 16-20 kbytes. They are barely the same size. Samuel, thanks for editing. – RiantHoff Dec 24 '12 at 16:54

2 Answers2

0

You can time your code and find out how long each method is taking to execute by doing something like this:

long startTime = System.nanoTime();
methodToTime();
long endTime = System.nanoTime();

long duration = endTime - startTime;

Then print it out somewhere.

Source: How do I time a method's execution in Java?

Community
  • 1
  • 1
Michael
  • 3,334
  • 20
  • 27
  • I'am already done that before asking. And i see that reading http is times slower then html file saving. That's why i ask about reading http, not saving html. Duration says every page need to read about .2 seconds. But i don't know maybe it is http server is too slow, not method of reading pages from it. Also, i have noted that pages reading .2 +- .1 seconds, but one of 5-10 files read is way too slower, i.e. 1or 2 seconds. Why is that? – RiantHoff Dec 24 '12 at 16:48
  • @RiantHoff Probably because that page is significantly bigger than the others. – Michael Dec 24 '12 at 16:51
0

You usually have somewhat high latency reading html (proxies etc.). However a decent webserver can handle a lot of requests concurrently.

In order to get more throughput you should execute several http reads concurrently (use a threadpool).

Depending on the server used, you'll increase your throughput by 10..20 times (also depends wether your client OS is not limited in number of connections).

R.Moeller
  • 3,436
  • 1
  • 17
  • 12