4

Good day.

I have blocker issue with my web crawler project. Logic is simple. First creates one Runnable, it downloads html document, scans all links and then on all funded links it creates new Runnable objects. Each new created Runnable in its turn creates new Runnable objects for each link and execute them.

Problem is that ExecutorService never stops.

CrawlerTest.java

public class CrawlerTest {

    public static void main(String[] args) throws InterruptedException {
        new CrawlerService().crawlInternetResource("https://jsoup.org/");
    }
}

CrawlerService.java

import java.io.IOException;
import java.util.Collections;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class CrawlerService {

    private Set<String> uniqueUrls = Collections.newSetFromMap(new ConcurrentHashMap<String, Boolean>(10000));
    private ExecutorService executorService = Executors.newFixedThreadPool(8);
    private String baseDomainUrl;

    public void crawlInternetResource(String baseDomainUrl) throws InterruptedException {
        this.baseDomainUrl = baseDomainUrl;
        System.out.println("Start");
        executorService.execute(new Crawler(baseDomainUrl)); //Run first thread and scan main domain page. This thread produce new threads.
        executorService.awaitTermination(10, TimeUnit.MINUTES);
        System.out.println("End");
    }

    private class Crawler implements Runnable { // Inner class that encapsulates thread and scan for links

        private String urlToCrawl;

        public Crawler(String urlToCrawl) {
            this.urlToCrawl = urlToCrawl;
        }

        public void run() {
            try {
                findAllLinks();
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }

        private void findAllLinks() throws InterruptedException {
            /*Try to add new url in collection, if url is unique adds it to collection, 
             * scan document and start new thread for finded links*/
            if (uniqueUrls.add(urlToCrawl)) { 
                System.out.println(urlToCrawl);

                Document htmlDocument = loadHtmlDocument(urlToCrawl);
                Elements findedLinks = htmlDocument.select("a[href]");

                for (Element link : findedLinks) {
                    String absLink = link.attr("abs:href");
                    if (absLink.contains(baseDomainUrl) && !absLink.contains("#")) { //Check that we are don't go out of domain
                        executorService.execute(new Crawler(absLink)); //Start new thread for each funded link
                    }
                }
            }
        }

        private Document loadHtmlDocument(String internetResourceUrl) {
            Document document = null;
            try {
                document = Jsoup.connect(internetResourceUrl).ignoreHttpErrors(true).ignoreContentType(true)
                        .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0")
                        .timeout(10000).get();
            } catch (IOException e) {
                System.out.println("Page load error");
                e.printStackTrace();
            }
            return document;
        }
    }
}

This app need about 20 secs to scan jsoup.org for all unique links. But it just wait 10 minutes executorService.awaitTermination(10, TimeUnit.MINUTES); and then I see dead main thread and still working executor.

Threads

How to force ExecutorService work correctly?

I think problem is that it invoke executorService.execute inside another task instead in main thread.

Redeemer
  • 123
  • 8
  • Handle `executorService` in try catch and write `executorService.shutdown();` in `finally` block. [Reference](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html) – Imran Aug 12 '16 at 10:23
  • @Imran Doesn't work. It still wait 10 minutes until main thread dies. I think problem is that it invoke executorService.execute inside another task instead main thread. – Redeemer Aug 12 '16 at 10:31

4 Answers4

3

You are misusing awaitTermination. According to javadoc you should call shutdown first:

Blocks until all tasks have completed execution after a shutdown request, or the timeout occurs, or the current thread is interrupted, whichever happens first.

To achieve your goal I'd suggest to use CountDownLatch (or latch that support increments like this one) to determine exact moment when there is no tasks left so you safely can do shutdown.

Community
  • 1
  • 1
vsminkov
  • 10,912
  • 2
  • 38
  • 50
  • I can't use CountDownLatch because I don't know beforehand how many unique links I will collect from resource. – Redeemer Aug 12 '16 at 10:34
  • If call executorService.shutdown(); before executorService.awaitTermination(10, TimeUnit.MINUTES); it just wait only first thread and crawller collect only first link https://jsoup.org/. I think problem is that it invoke executorService.execute inside another task instead main thread. – Redeemer Aug 12 '16 at 10:41
  • 1
    @Redeemer You can make a latch which supports counting up (like in the link in my answer). then you'll start with count 1. for each found link (not yet processed) you'll increment it by one and decrement after your crawler finished processing. in `crawlInternetResource` you'll wait until latch count is zero. – vsminkov Aug 12 '16 at 10:46
  • A latch won't help. If tasks can create new tasks then that`s probably not going to work out well. @Redeemer You need to set a "maximum depth". It's the internet. Not sure about its momentary width but it could be quite deep. – Fildor Aug 12 '16 at 11:02
  • @Fildor well... to limit depth you still can use latch :D – vsminkov Aug 12 '16 at 11:05
  • OK, If you **want** to use a latch ... :D – Fildor Aug 12 '16 at 11:35
2

I see your comment from earlier:

I can't use CountDownLatch because I don't know beforehand how many unique links I will collect from resource.

First off, vsminkov is spot on with the answer as to why awaitTermniation will sit and wait for 10 minutes. I will offer an alternate solution.

Instead of using a CountDownLatch use a Phaser. For each new task, you can register, and await completion.

Create a single phaser and register each time a execute.submit is invoked and arrive each time a Runnable completes.

public void crawlInternetResource(String baseDomainUrl) {
    this.baseDomainUrl = baseDomainUrl;

    Phaser phaser = new Phaser();
    executorService.execute(new Crawler(phaser, baseDomainUrl)); 
    int phase = phaser.getPhase();
    phase.awaitAdvance(phase);
}

private class Crawler implements Runnable { 

    private final Phaser phaser;
    private String urlToCrawl;

    public Crawler(Phaser phaser, String urlToCrawl) {
        this.urlToCrawl = urlToCrawl;
        this.phaser = phaser;
        phaser.register(); // register new task
    }

    public void run(){
       ...
       phaser.arrive(); //may want to surround this in try/finally
    }
John Vint
  • 39,695
  • 7
  • 78
  • 108
0

You are not calling shutdown.

This may work - An AtomicLong variable in the CrawlerService. Increment before every new sub task is submitted to executor service.

Modify your run() method to decrement this counter and if 0, shutdown the executor service

public void run() {
    try {
        findAllLinks();
    } catch (InterruptedException e) {
        e.printStackTrace();
    } finally {
        //decrements counter
        //If 0, shutdown executor from here or just notify CrawlerService who would be doing wait().
    }
}

In the "finally", reduce the counter and when the counter is zero, shutdown executor or just notify CrawlerService. 0 means, this is the last one, no other is running, none pending in queue. No task will submit any new sub tasks.

Rajesh Jose
  • 314
  • 2
  • 12
0

How to force ExecutorService work correctly?

I think problem is that it invoke executorService.execute inside another task instead in main thread.

No. The problem is not with ExecutorService. You are using APIs in incorrect manner and hence not getting right result.

You have to use three APIs in a certain order to get right result.

1. shutdown
2. awaitTermination
3. shutdownNow

Recommended way from oracle documentation page of ExecutorService:

 void shutdownAndAwaitTermination(ExecutorService pool) {
   pool.shutdown(); // Disable new tasks from being submitted
   try {
     // Wait a while for existing tasks to terminate
     if (!pool.awaitTermination(60, TimeUnit.SECONDS)) {
       pool.shutdownNow(); // Cancel currently executing tasks
       // Wait a while for tasks to respond to being cancelled
       if (!pool.awaitTermination(60, TimeUnit.SECONDS))
           System.err.println("Pool did not terminate");
     }
   } catch (InterruptedException ie) {
     // (Re-)Cancel if current thread also interrupted
     pool.shutdownNow();
     // Preserve interrupt status
     Thread.currentThread().interrupt();
   }

shutdown(): Initiates an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted.

shutdownNow():Attempts to stop all actively executing tasks, halts the processing of waiting tasks, and returns a list of the tasks that were awaiting execution.

awaitTermination():Blocks until all tasks have completed execution after a shutdown request, or the timeout occurs, or the current thread is interrupted, whichever happens first.

On a different note: If you want to wait for all tasks to complete, refer to this related SE question:

wait until all threads finish their work in java

I prefer using invokeAll() or ForkJoinPool(), which are best suited for your use case.

Community
  • 1
  • 1
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211