I want to get the crawled content by nutch in text file. I have used the #readseg commads but output is not fruitful.
Is there is some plugin which can get nutch to crawl and store the url and content in text file.
I want to get the crawled content by nutch in text file. I have used the #readseg commads but output is not fruitful.
Is there is some plugin which can get nutch to crawl and store the url and content in text file.
With nutch 1, you can do something like:
./bin/nutch readseg -get out-crawl/segments/20160823085007/ "https://en.wikipedia.org/wiki/Canon" -nofetch -nogenerate -noparse -noparsedata -noparsetext > Canon.html
It still comes with a few line to get rid off at the begining of the file.
You can modify Fetch Job of Nutch to get URLs and page content belong to the URLs during the crawling process. In the source code file (src/java/org/apache/nutch/fetcher/FetcherReducer.java):
case ProtocolStatusCodes.SUCCESS: // got a page
String URL= TableUtil.reverseUrl(fit.url); //URL
content = Bytes.toString(ByteBuffer.wrap((content.getContent()))));//URL belong the URL
output(fit, content, status, CrawlStatus.STATUS_FETCHED);
break;
Hope this helps,
Le Quoc Do