I need to write a webscraper that scrapes around ~1M websites and saves their title, description and keywords into 1 big file (containing the scraped URL and the related words). The URLs should be extracted from a big file.
I've ran Crawler4j on the 1M URLs file and started the webcrawler using this: controller.start(MyCrawler.class, 20)
. 20 is an arbitrary number. Each crawler passes the resulted words into a blocking queue for a single thread to write these words and URL to the file. I've used 1 writer thread in order to not synchronize on the file. I set the crawl depth to 0 (I only need to crawl my seed list)
After running this for the night I've only downloaded around 200K of URLs. I'm running the scraper on 1 machine using a wired connection. Since most of the URLs are of different hosts I don't think the politeness parameter has any importance here.
EDIT
I tried starting the Crawler4j using the nonblocking start but it just got blocked. My Crawler4j version is: 4.2. This is the code I'm using:
CrawlConfig config = new CrawlConfig();
List<Header> headers = Arrays.asList(
new BasicHeader("Accept", "text/html,text/xml"),
new BasicHeader("Accept-Language", "en-gb, en-us, en-uk")
);
config.setDefaultHeaders(headers);
config.setCrawlStorageFolder(crawlStorageFolder);
config.setMaxDepthOfCrawling(0);
config.setUserAgentString("testcrawl");
config.setIncludeBinaryContentInCrawling(false);
config.setPolitenessDelay(10);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
BlockingQueue<String> urlsQueue = new ArrayBlockingQueue<>(400);
controller = new CrawlController(config, pageFetcher, robotstxtServer);
ExecutorService executorService = Executors.newSingleThreadExecutor();
Runnable writerThread = new FileWriterThread(urlsQueue, crawlStorageFolder, outputFile);
executorService.execute(writerThread);
controller.startNonBlocking(() -> {
return new MyCrawler(urlsQueue);
}, 4);
File file = new File(urlsFileName);
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String url;
while ((url = br.readLine()) != null) {
controller.addSeed(url);
}
}
EDIT 1 - This is the code for MyCrawler
public class MyCrawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg|png|mp3|mp3|zip|gz))$");
public static final String DELIMETER = "||||";
private final StringBuilder buffer = new StringBuilder();
private final BlockingQueue<String> urlsQueue;
public MyCrawler(BlockingQueue<String> urlsQueue) {
this.urlsQueue = urlsQueue;
}
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches();
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData parseData = (HtmlParseData) page.getParseData();
String html = parseData.getHtml();
String title = parseData.getTitle();
Document document = Jsoup.parse(html);
buffer.append(url.replaceAll("[\n\r]", "")).append(DELIMETER).append(title);
Elements descriptions = document.select("meta[name=description]");
for (Element description : descriptions) {
if (description.hasAttr("content"))
buffer.append(description.attr("content").replaceAll("[\n\r]", ""));
}
Elements elements = document.select("meta[name=keywords]");
for (Element element : elements) {
String keywords = element.attr("content").replaceAll("[\n\r]", "");
buffer.append(keywords);
}
buffer.append("\n");
String urlContent = buffer.toString();
buffer.setLength(0);
urlsQueue.add(urlContent);
}
}
private boolean isSuccessful(int statusCode) {
return 200 <= statusCode && statusCode < 400;
}
}
And so I have 2 questions:
- can someone suggest any other way to make this process take less time? Maybe somehow tuning the number of crawler threads ? Maybe some other optimizations? I'd prefer a solution that doesn't require several machines but if you think that's the only way to role could someone suggest how to do that? maybe a code example?
- Is there any way to make the crawler start working on some URLs and keep adding more URLs during the crawl? I've looked at
crawler.startNonBlocking
but it doesn't seem to work very well
Thanks in advance