Good day.
I have blocker issue with my web crawler project.
Logic is simple. First creates one Runnable
, it downloads html document, scans all links and then on all funded links it creates new Runnable
objects. Each new created Runnable
in its turn creates new Runnable
objects for each link and execute them.
Problem is that ExecutorService
never stops.
CrawlerTest.java
public class CrawlerTest {
public static void main(String[] args) throws InterruptedException {
new CrawlerService().crawlInternetResource("https://jsoup.org/");
}
}
CrawlerService.java
import java.io.IOException;
import java.util.Collections;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CrawlerService {
private Set<String> uniqueUrls = Collections.newSetFromMap(new ConcurrentHashMap<String, Boolean>(10000));
private ExecutorService executorService = Executors.newFixedThreadPool(8);
private String baseDomainUrl;
public void crawlInternetResource(String baseDomainUrl) throws InterruptedException {
this.baseDomainUrl = baseDomainUrl;
System.out.println("Start");
executorService.execute(new Crawler(baseDomainUrl)); //Run first thread and scan main domain page. This thread produce new threads.
executorService.awaitTermination(10, TimeUnit.MINUTES);
System.out.println("End");
}
private class Crawler implements Runnable { // Inner class that encapsulates thread and scan for links
private String urlToCrawl;
public Crawler(String urlToCrawl) {
this.urlToCrawl = urlToCrawl;
}
public void run() {
try {
findAllLinks();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
private void findAllLinks() throws InterruptedException {
/*Try to add new url in collection, if url is unique adds it to collection,
* scan document and start new thread for finded links*/
if (uniqueUrls.add(urlToCrawl)) {
System.out.println(urlToCrawl);
Document htmlDocument = loadHtmlDocument(urlToCrawl);
Elements findedLinks = htmlDocument.select("a[href]");
for (Element link : findedLinks) {
String absLink = link.attr("abs:href");
if (absLink.contains(baseDomainUrl) && !absLink.contains("#")) { //Check that we are don't go out of domain
executorService.execute(new Crawler(absLink)); //Start new thread for each funded link
}
}
}
}
private Document loadHtmlDocument(String internetResourceUrl) {
Document document = null;
try {
document = Jsoup.connect(internetResourceUrl).ignoreHttpErrors(true).ignoreContentType(true)
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0")
.timeout(10000).get();
} catch (IOException e) {
System.out.println("Page load error");
e.printStackTrace();
}
return document;
}
}
}
This app need about 20 secs to scan jsoup.org for all unique links. But it just wait 10 minutes executorService.awaitTermination(10, TimeUnit.MINUTES);
and then I see dead main thread and still working executor.
How to force ExecutorService
work correctly?
I think problem is that it invoke executorService.execute inside another task instead in main thread.