How to get the crawled pages content and corresponding URL in nutch?

Question

I want to get the crawled content by nutch in text file. I have used the #readseg commads but output is not fruitful.

Is there is some plugin which can get nutch to crawl and store the url and content in text file.

Maybe [this](http://stackoverflow.com/questions/5123757/how-to-get-the-html-content-from-nutch) question can help. — netcyrax, Apr 03 '14 at 03:24

score 2 · Accepted Answer · answered Sep 01 '16 at 13:41

With nutch 1, you can do something like:

./bin/nutch readseg -get out-crawl/segments/20160823085007/  "https://en.wikipedia.org/wiki/Canon" -nofetch -nogenerate -noparse -noparsedata -noparsetext > Canon.html

It still comes with a few line to get rid off at the begining of the file.

score 1 · Answer 2 · answered Sep 26 '14 at 14:39

You can modify Fetch Job of Nutch to get URLs and page content belong to the URLs during the crawling process. In the source code file (src/java/org/apache/nutch/fetcher/FetcherReducer.java):

      case ProtocolStatusCodes.SUCCESS:        // got a page
          String URL= TableUtil.reverseUrl(fit.url); //URL
          content = Bytes.toString(ByteBuffer.wrap((content.getContent()))));//URL belong the URL
          output(fit, content, status, CrawlStatus.STATUS_FETCHED);
          break;

Hope this helps,

Le Quoc Do

How to get the crawled pages content and corresponding URL in nutch?

2 Answers2