Nutch Raw Html Saving

Question

I'm trying to get raw html of crawled pages in different files, named as url of the page. Is it possible with Nutch to save the raw html pages in different files by ruling out the indexing part?

You can look at this post [How do I save the origin html file with Apache Nutch][1] [1]: https://stackoverflow.com/questions/10007178/how-do-i-save-the-origin-html-file-with-apache-nutch — Haider Ali, Feb 07 '14 at 10:55

score 2 · Accepted Answer · edited May 23 '17 at 12:23

2

The is no direct way to do that. You will have to do few code modifications. See this and this.

edited May 23 '17 at 12:23

Community

1
1

answered Apr 14 '12 at 02:06

Tejas Patil

6,149
1
23
38

Thanks, for the ones looking for the answer, the answer in the first link is useful. – İsmet Alkan Apr 14 '12 at 11:48

Nutch Raw Html Saving

1 Answers1

Linked