2

So I'm working on web scraping for a certain website. The problem is:

Given a set of URLs (in the order of 100s to 1000s), I would like to retrieve the HTML of each URL in an efficient manner, specially time-wise. I need to be able to do 1000s of requests every 5 minutes.

This should usually imply using a pool of threads to do requests from a set of not yet requested urls. But before jumping into implementing this, I believe that it's worth asking here since I believe this is a fairly common problem when doing web scraping or web crawling.

Is there any library that has what I need?

Juan Andrés Diana
  • 2,215
  • 3
  • 25
  • 36

4 Answers4

1

So I'm working on web scraping for a certain website.

Are you scraping a single server or is the website scraping from multiple other hosts? If it is the former, then the server you are scraping may not like too many concurrent connections from a single i/p.

If it is the latter, this is really a general question on how many outbound connections you should open from a machine. There is physical limit, but it is pretty large. Practically, it would depend on where that client is getting deployed. The better the connectivity, the higher number of connections it can accommodate.

You might want to look at the source code of a good download manager to see if they have a limit on the number of outbound connections.

Definitely user asynchronous i/o, but you would still do well to limit the number.

Miserable Variable
  • 28,432
  • 15
  • 72
  • 133
  • I'm not sure how many servers are behind but it's only one website. The website handles many connections from different IPs definitely not sure if they have restrictions on one IP, I will have to try to find out. – Juan Andrés Diana Jan 17 '13 at 02:21
  • May not be one, but they'll probably not like hundred concurrent connections, especially if this not a business interest for them – Miserable Variable Jan 17 '13 at 02:24
0

I would start by researching asynchronous communication. Then take a look at Netty.

Keep in mind there is always a limit to how fast one can load a web page. For an average home connection, it will be around a second. Take this into consideration when programming your application.

syb0rg
  • 8,057
  • 9
  • 41
  • 81
  • My idea is doing a DNS request. Then doing many simultaneous requests and getting only the HTML files. That shouldn't take more than 500ms I think, provided that the network connectivity is good enough. I've been looking at Netty but I'm not sure how it could help me with this. – Juan Andrés Diana Jan 17 '13 at 02:35
  • Agreed, but remember that time adds up quickly. Depending on how many URLs you have (say 1000), it could take around 10 minutes to process those 1000 URLs and retrieve the files. – syb0rg Jan 17 '13 at 02:38
0

Your bandwidth utilization will be the sum of all of the HTML documents that you retrieve (plus a little overhead) no matter how you slice it (though some web servers may support compressed HTTP streams, so certainly use a client capable of accepting them).

The optimal number of concurrent threads depends a great deal on your network connectivity to the sites in question. Only experimentation can find an optimal number. You can certainly use one set of threads for retrieving HTML documents and a separate set of threads to process them to make it easier to find the right balance.

I'm a big fan of HTML Agility Pack for web scraping in the .NET world but cannot make a specific recommendation for Java. The following question may be of use in finding a good, Java based scraping platform

Web scraping with Java

Community
  • 1
  • 1
Eric J.
  • 147,927
  • 63
  • 340
  • 553
0

http://wwww.Jsoup.org just for scrapping part! The thread pooling i think you should implement urself.

Update

if this approach is fitting your need, you can download the complete class files here: http://codetoearn.blogspot.com/2013/01/concurrent-web-requests-with-thread.html

    AsyncWebReader webReader = new AsyncWebReader(5/*number of threads*/, new String[]{
              "http://www.google.com",
              "http://www.yahoo.com",
              "http://www.live.com",
              "http://www.wikipedia.com",
              "http://www.facebook.com",
              "http://www.khorasannews.com",
              "http://www.fcbarcelona.com",
              "http://www.khorasannews.com",
            });

    webReader.addObserver(new Observer() {
      @Override
      public void update(Observable o, Object arg) {
        if (arg instanceof Exception) {
          Exception ex = (Exception) arg;
          System.out.println(ex.getMessage());
        } /*else if (arg instanceof List) {
          List vals = (List) arg;
          System.out.println(vals.get(0) + ": " + vals.get(1));
        } */else if (arg instanceof Object[]) {
          Object[] objects = (Object[]) arg;
          HashMap result = (HashMap) objects[0];
          String[] success = (String[]) objects[1];
          String[] fail = (String[]) objects[2];

          System.out.println("Failds");
          for (int i = 0; i < fail.length; i++) {
            String string = fail[i];
            System.out.println(string);
          }

          System.out.println("-----------");
          System.out.println("success");
          for (int i = 0; i < success.length; i++) {
            String string = success[i];
            System.out.println(string);
          }

          System.out.println("\n\nresult of Google: ");
          System.out.println(result.remove("http://www.google.com"));
        }
      }
    });
    Thread t = new Thread(webReader);
    t.start();
    t.join();
ehsun7b
  • 4,796
  • 14
  • 59
  • 98