Integrating Nutch web-crawling functionality into a Java application

Question

I would use Apache Nutch into my Java application to crawl web pages from one ore more websites. Basically, I need to call a method of my Java application for each web page found by the web-crawler, in order to process page content (text, etc.). How to achieve this?

please also see this related answer i posted elsewhere: http://stackoverflow.com/questions/10007178/how-do-i-save-the-origin-html-file-with-apache-nutch/19274588#19274588 — andrew.butkus, Jun 01 '16 at 13:50

score 3 · Accepted Answer · answered Jun 01 '16 at 13:33

Well, your question appears to be an "XY Problem", Nutch could be used as a library in your custom Java application, the bin/nutch and bin/crawl scripts basically just executes several Java classes with the right parameters, so in your application you could call the right classes with the right parameters, taking a look at the bin/crawl script will provide you with the right sequence of steps (and classes) to call for a full cycle crawl. This should only be used for small crawls.

Now, going back to the XY problem, if all you need is to extract custom text/metadata from the webpages you could just extend Nutch itself without the need to write your custom application. From what you described looks like you are after a custom parser/indexing plugin. If this is the case I recommend taking a look at the headings plugin (https://github.com/apache/nutch/tree/master/src/plugin/headings) which is a very good starting point to write your own HtmlParseFilter plugin. You'll still need to write custom code but it will be contained in a Nutch plugin.

Also you could check out https://issues.apache.org/jira/browse/NUTCH-1870, this plugin allows to extract custom portions of the HTML using XPath expressions.

Integrating Nutch web-crawling functionality into a Java application

1 Answers1