How to control the way Nutch parses and Solr indexes a URL when its HTML structure is unknown?

Question

I am trying to crawl some sites which has poorly maintained HTML structure and I have no control over it to change it. When I look at the nutch crawled data indexed by Solr, the field 'title' looks okay where as the 'content' field includes lot of junk as it grabbed all the text from the html banner with its drop down menu and worked down into the left side menu, navigation, footer etc.

In my case, I am interested to just grab the "Description:" information which is defined in a paragragh on HTML page into 'content' field.

Example: (raw html):

 <p><strong>Description:</strong> Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over.

How can I filter the junk out of the 'content' field and only have the information I am interested in?

nimeshjm · Answer 1 · 2013-05-14T20:50:28.317

1

You can use the plugin below to extract content based on XPath queries. If your content is in a specific div, you can use this plugin to extract the content you want from that specific section.

Filter xpath

edited May 14 '13 at 20:50

answered May 14 '13 at 20:44

nimeshjm

1,665
10
13

Thanks for the link. But its useful for a website of known structure "Using this plugin we are now able to extract the desired data from web site with known structure." I am looking for a way to extract from a website with unknown structure – sunskin Jul 12 '13 at 14:36
In that case you would need to create a new filter and parse the content to find the "Description"
field. http://wiki.apache.org/nutch/WritingPluginExample or http://stackoverflow.com/questions/15717239/apache-nutch-2-1-how-get-complete-source-code/15743466#15743466
– nimeshjm Jul 12 '13 at 16:27
the "Description
" is just one example, not all the sites I crawled would have a similar structure
– sunskin Jul 14 '13 at 19:31

How to control the way Nutch parses and Solr indexes a URL when its HTML structure is unknown?

1 Answers1