I have managed to get apache nutch to index a news website and pass the results off to Apache solr.
Using this tutorial https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup the only difference is I have decided to use Cassandra instead.
As a test I am trying to crawl Cnn, to extract out the title of article's and the date it was published.
Question 1:
How to parse data from the webpage, to extract the date and the title.
I have found this article for a plugin. It seems a bit out dated and am not sure that it still applies. I have also read that Tika can be used as well but again most tutorials are quite old.
http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/
Another SO article is this
How to extend Nutch for article crawling. I would prefer to use Nutch, only because that is what I have started with. I have do not really have a preference.
Anything would be a great help.