I am trying to crawl some sites which has poorly maintained HTML structure and I have no control over it to change it. When I look at the nutch crawled data indexed by Solr, the field 'title' looks okay where as the 'content' field includes lot of junk as it grabbed all the text from the html banner with its drop down menu and worked down into the left side menu, navigation, footer etc.
In my case, I am interested to just grab the "Description:" information which is defined in a paragragh on HTML page into 'content' field.
Example: (raw html):
<p><strong>Description:</strong> Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over.
How can I filter the junk out of the 'content' field and only have the information I am interested in?
field. http://wiki.apache.org/nutch/WritingPluginExample or http://stackoverflow.com/questions/15717239/apache-nutch-2-1-how-get-complete-source-code/15743466#15743466
– nimeshjm Jul 12 '13 at 16:27" is just one example, not all the sites I crawled would have a similar structure
– sunskin Jul 14 '13 at 19:31