4

I need to write a webscraper that scrapes around ~1M websites and saves their title, description and keywords into 1 big file (containing the scraped URL and the related words). The URLs should be extracted from a big file.

I've ran Crawler4j on the 1M URLs file and started the webcrawler using this: controller.start(MyCrawler.class, 20). 20 is an arbitrary number. Each crawler passes the resulted words into a blocking queue for a single thread to write these words and URL to the file. I've used 1 writer thread in order to not synchronize on the file. I set the crawl depth to 0 (I only need to crawl my seed list)

After running this for the night I've only downloaded around 200K of URLs. I'm running the scraper on 1 machine using a wired connection. Since most of the URLs are of different hosts I don't think the politeness parameter has any importance here.

EDIT

I tried starting the Crawler4j using the nonblocking start but it just got blocked. My Crawler4j version is: 4.2. This is the code I'm using:

CrawlConfig config = new CrawlConfig();
List<Header> headers = Arrays.asList(
        new BasicHeader("Accept", "text/html,text/xml"),
        new BasicHeader("Accept-Language", "en-gb, en-us, en-uk")
);
config.setDefaultHeaders(headers);
config.setCrawlStorageFolder(crawlStorageFolder);
config.setMaxDepthOfCrawling(0);
config.setUserAgentString("testcrawl");
config.setIncludeBinaryContentInCrawling(false);
config.setPolitenessDelay(10);

PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);

BlockingQueue<String> urlsQueue = new ArrayBlockingQueue<>(400);
controller = new CrawlController(config, pageFetcher, robotstxtServer);

ExecutorService executorService = Executors.newSingleThreadExecutor();
Runnable writerThread = new FileWriterThread(urlsQueue, crawlStorageFolder, outputFile);

executorService.execute(writerThread);

controller.startNonBlocking(() -> {
    return new MyCrawler(urlsQueue);
}, 4);

File file = new File(urlsFileName);
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
    String url;
    while ((url = br.readLine()) != null) {
        controller.addSeed(url);
    }
}

EDIT 1 - This is the code for MyCrawler

public class MyCrawler extends WebCrawler {
    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg|png|mp3|mp3|zip|gz))$");
    public static final String DELIMETER = "||||";
    private final StringBuilder buffer = new StringBuilder();
    private final BlockingQueue<String> urlsQueue;

    public MyCrawler(BlockingQueue<String> urlsQueue) {
        this.urlsQueue = urlsQueue;
    }

    @Override
    public boolean shouldVisit(Page referringPage, WebURL url) {
        String href = url.getURL().toLowerCase();
        return !FILTERS.matcher(href).matches();
    }

    @Override
    public void visit(Page page) {
        String url = page.getWebURL().getURL();

        if (page.getParseData() instanceof HtmlParseData) {
            HtmlParseData parseData = (HtmlParseData) page.getParseData();
            String html = parseData.getHtml();
            String title = parseData.getTitle();

            Document document = Jsoup.parse(html);
            buffer.append(url.replaceAll("[\n\r]", "")).append(DELIMETER).append(title);
            Elements descriptions = document.select("meta[name=description]");
            for (Element description : descriptions) {
                if (description.hasAttr("content"))
                    buffer.append(description.attr("content").replaceAll("[\n\r]", ""));
            }

            Elements elements = document.select("meta[name=keywords]");
            for (Element element : elements) {
                String keywords = element.attr("content").replaceAll("[\n\r]", "");
                buffer.append(keywords);
            }
            buffer.append("\n");
            String urlContent = buffer.toString();
            buffer.setLength(0);
            urlsQueue.add(urlContent);
        }
    }

    private boolean isSuccessful(int statusCode) {
        return 200 <= statusCode && statusCode < 400;
    }
}

And so I have 2 questions:

  1. can someone suggest any other way to make this process take less time? Maybe somehow tuning the number of crawler threads ? Maybe some other optimizations? I'd prefer a solution that doesn't require several machines but if you think that's the only way to role could someone suggest how to do that? maybe a code example?
  2. Is there any way to make the crawler start working on some URLs and keep adding more URLs during the crawl? I've looked at crawler.startNonBlocking but it doesn't seem to work very well

Thanks in advance

Gideon
  • 2,211
  • 5
  • 29
  • 47
  • 2
    Looks like you are adding all 1M URL's to the seedlist and starting the crawling. I would suggest you to load URL's incrementally and use startNonBlocking for starting the crawling process. Also you can reduce No. of crawler thread instead of 20. I hope you are setting the crawl depth as per requirement. – Ramachandra Reddy Feb 16 '16 at 05:27
  • I did try to first start the crawler as non blocking and then add more URLs but it seems that the crawler doesn't respond to adding URLs after it started, it just does nothing. Is there any way to calculate optimum number of crawler threads? And yes I set the crawl depth to 0 (added to the question) – Gideon Feb 16 '16 at 05:50
  • Optimum number of threads depends on the Server configuration (Like no cores, memory etc). have you put a statement controller.waitUntilFinish after the call to startNonBlocking, if not can you add and give a try – Ramachandra Reddy Feb 16 '16 at 06:22
  • I came across the below sample https://github.com/yasserg/crawler4j/blob/master/src/test/java/edu/uci/ics/crawler4j/examples/multiple/MultipleCrawlerController.java – Ramachandra Reddy Feb 16 '16 at 06:25
  • Yup I've seen it too however it doesn't work. It seems like the crawler is not responding to new URLs. what I did is startNonBlocking on the crawler then start adding URLs and then waitUntilFinish but it just hangs there. Regarding the number of threads - is there any equation to the optimal number of threads? Looking at the WebCrawler source code it seems like it's just using Apache HTTP client in a blocking manner, do you think it's possible to make it asynchronous somehow? – Gideon Feb 16 '16 at 08:20

1 Answers1

4

crawler4j is per default designed to be run on one machine. From the field of web-crawling, we know, that web-crawler performance depends primary on the following four resources:

  • Disk
  • CPU
  • Bandwidth
  • (RAM)

Defining the optimal number of threads depends on your hardware setup. Thus, more machines will result in a higher throughput. The next hard limitation is the network bandwidth. If you are not attached via highspeed Internet, this will be the bottleneck of your approach.

Moreover, crawler4j is not designed to load such a huge seed file per default. This is due to the fact, that crawler4j resepcts crawler politness. This implies, that - before the crawl starts - every seed point is checked for a robots.txt, which can take quite a bit of time.

Adding seeds after the crawl is started, is possible and should work, if the crawl is started in non-blocking mode. However, it can take a while until the URLs are processed.

For a multi-machine setup you can take a look at Apache Nutch. However, Nutch is a bit difficult to learn.

EDIT:

After reproducing your setup, I'm able to answer your issue regarding the addition of seed pages in a dynamic way.

Starting the crawler in this manner

controller.startNonBlocking(() -> {
    return new MyCrawler(urlsQueue);
}, 4);

will invoke the run() method of every crawler thread. Investigating this method, we find a method named frontier.getNextURLs(50, assignedURLs);, which is responsible for taking unseen URLs from the frontier in order to process them. In this method, we find a so-called waitingList, which causes the thread to wait. Since notifyAll is never invoked on waitingList until the controller is shutdown, the threads will never reschedule new URLs.

To overcome this issue, you have two possible solutions:

  1. Just add at least one URL per thread as a seed point. The deadlock situation will not occur. After starting the threads in non blocking mode, you can just add seeds as you like.

    controller.addSeed("https://www.google.de");
    
    controller.startNonBlocking(() -> {
        return new MyCrawler(urlsQueue);
    }, 4);
    
    controller.addSeed("https://www.google.de/test");
    
    controller.waitUntilFinish();
    
  2. Go for a fork of the Github project and adapt the code of Frontier.java so that the waitingList.notifyAll() method can be invoked from the CrawlController after seed pages are dynamically added.

rzo1
  • 5,561
  • 3
  • 25
  • 64
  • 1
    thanks for the answer. I tried adding just a few seeds after starting in non blocking and it still didn't do anything (no exceptions either). So basically you're saying crawler4j is not a good match for that? I looked at Nutch but it's very unfriendly if you're trying to do anything that is not a part of the "out of the box" configuration, it's also weird because even simple fetching and parsing use cases require you to be very familiar with Nutch which kinds of prevents it from being a good candidate for a POC evaluation – Gideon Feb 24 '16 at 15:03
  • 1
    Could you provide the corresponding code for your non blocking try as well as the used crawler4j version via an edit, so I can take a deeper look into it? I did modify crawler4j to be used with the Java Parallel Processing Framework and integrated the IMDG Hazelcast. This scales quite well in a multi machine setup. Sadly this is not ready for a public release yet. – rzo1 Feb 24 '16 at 15:15
  • Done. Let me know if you need any more information about thanks for the help – Gideon Feb 24 '16 at 15:25
  • Sample implementation of your MyCrawler.class could also be useful. I will take a look at it it in the evening. – rzo1 Feb 24 '16 at 15:48
  • Added the full implementation. Are you the author of crawler4j? – Gideon Feb 24 '16 at 16:12
  • Nope. But I spend a bit of time in reading the crawler4j source in order to modify it for my needs ;-) – rzo1 Feb 24 '16 at 16:13
  • Cool, I appreciate your help – Gideon Feb 24 '16 at 16:16
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/104450/discussion-between-rzo-and-gideon). – rzo1 Feb 24 '16 at 17:08
  • Your first suggestion works, thanks a lot for that. Could you just add a few words on how to determine the number of threads per machine? Is it anything besides cores * threads_per_core? – Gideon Feb 25 '16 at 12:59
  • See: http://stackoverflow.com/questions/24149834/choosing-optimal-number-of-threads-for-parallel-processing-of-data – rzo1 Feb 25 '16 at 13:16
  • I'm familiar with choosing number of threads per machine for CPU intensive tasks, however this seems to be more IO bounded which is somewhat different, could you shed some light on that? maybe from your personal experience? – Gideon Feb 25 '16 at 13:28
  • We run the modified version of crawler4j on a cluster of 20 virtual machines with 8 virtual cpus each. After observing the behaviour running different configurations, we chose to start 84 JPPF tasks with 16 threads per task. That means, we run 16 * 4 = 64 crawler threads per virtual machine with a politeness of 800ms - this setup performed quite well (and really fast) for our purpose. – rzo1 Feb 25 '16 at 13:36