I would use Apache Nutch into my Java application to crawl web pages from one ore more websites. Basically, I need to call a method of my Java application for each web page found by the web-crawler, in order to process page content (text, etc.). How to achieve this?
-
please also see this related answer i posted elsewhere: http://stackoverflow.com/questions/10007178/how-do-i-save-the-origin-html-file-with-apache-nutch/19274588#19274588 – andrew.butkus Jun 01 '16 at 13:50
1 Answers
Well, your question appears to be an "XY Problem", Nutch could be used as a library in your custom Java application, the bin/nutch
and bin/crawl
scripts basically just executes several Java classes with the right parameters, so in your application you could call the right classes with the right parameters, taking a look at the bin/crawl
script will provide you with the right sequence of steps (and classes) to call for a full cycle crawl. This should only be used for small crawls.
Now, going back to the XY problem, if all you need is to extract custom text/metadata from the webpages you could just extend Nutch itself without the need to write your custom application. From what you described looks like you are after a custom parser/indexing plugin. If this is the case I recommend taking a look at the headings plugin (https://github.com/apache/nutch/tree/master/src/plugin/headings) which is a very good starting point to write your own HtmlParseFilter
plugin. You'll still need to write custom code but it will be contained in a Nutch plugin.
Also you could check out https://issues.apache.org/jira/browse/NUTCH-1870, this plugin allows to extract custom portions of the HTML using XPath expressions.

- 3,098
- 2
- 16
- 21