5

I'm new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the binary database files. How do I get the original html files with Nutch?

Does Nutch support it? If not, what other tools can I use to achieve my goal.(The tools that support distributed crawling are better.)

İsmet Alkan
  • 5,361
  • 3
  • 41
  • 64
Freedom
  • 803
  • 10
  • 26

5 Answers5

9

Well, nutch will write the crawled data in binary form so if if you want that to be saved in html format, you will have to modify the code. (this will be painful if you are new to nutch).

If you want quick and easy solution for getting html pages:

  1. If the list of pages/urls that you intend to have is quite low, then better get it done with a script which invokes wget for each url.
  2. OR use HTTrack tool.

EDIT:

Writing a your own nutch plugin will be great. Your problem will get solved plus you can contribute to nutch by submitting your work !!! If you are new to nutch (in terms of code & design), then you will have to invest lot of time building a new plugin ... else its easy to do.

Few pointers for helping your initiative:

Here is a page which talks about writing own nutch plugin.

Start with Fetcher.java. See lines 647-648. That is the place where you can get the fetched content on per url basis (for those pages which got fetched successfully).

pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS);
updateStatus(content.getContent().length);

You should add code right after this to invoke your plugin. Pass content object to it. By now, you would have guessed that content.getContent() is the content for url you want. Inside the plugin code, write it to some file. Filename should be based on the url name else it will be difficult to work with that. Url can be obtained by fit.url.

Tejas Patil
  • 6,149
  • 1
  • 23
  • 38
  • Thank you, TejasP. I just heard that Nutch has a plugin mechanism to extend its functionality. I'm wondering if I can write some plugin to make it possible? – Freedom Apr 09 '12 at 09:10
  • Thanks for the details and It's very helpful to me. It's a guide for me to plunge into Nutch. Very appreciate it! – Freedom Apr 10 '12 at 02:25
6

You must do modifications in run Nutch in Eclipse.

When you are able to run, open Fetcher.java and add the lines between "content saver" command lines.

case ProtocolStatus.SUCCESS:        // got a page
            pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
            updateStatus(content.getContent().length);'


            //------------------------------------------- content saver ---------------------------------------------\\
            String filename = "savedsites//" + content.getUrl().replace('/', '-');  

            File file = new File(filename);
            file.getParentFile().mkdirs();
            boolean exist = file.createNewFile();
            if (!exist) {
                System.out.println("File exists.");
            } else {
                FileWriter fstream = new FileWriter(file);
                BufferedWriter out = new BufferedWriter(fstream);
                out.write(content.toString().substring(content.toString().indexOf("<!DOCTYPE html")));
                out.close();
                System.out.println("File created successfully.");
            }
            //------------------------------------------- content saver ---------------------------------------------\\
İsmet Alkan
  • 5,361
  • 3
  • 41
  • 64
  • Will using this method skip creating the binary file too? – Alireza Noori Dec 26 '12 at 00:06
  • No, this is just saving the raw HTML file before binary file creation. You must rule out the binary file creation lines if you want to. However, I think that would be a hard work to do, as Nutch is a very big and complex project. – İsmet Alkan Dec 26 '12 at 00:16
6

To update this answer -

It is possible to post process the data from your crawldb segment folder, and read in the html (including other data nutch has stored) directly.

    Configuration conf = NutchConfiguration.create();
    FileSystem fs = FileSystem.get(conf);

    Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
    SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);

    try
    {
            Text key = new Text();
            Content content = new Content();

            while (reader.next(key, content)) 
            {
                    System.out.println(new String(content.GetContent()));
            }
    }
    catch (Exception e)
    {

    }
andrew.butkus
  • 777
  • 6
  • 19
0

The answers here are obsolete. Now, it is simply possible to get the plain HTML-files with nutch dump. Please see this answer.

CC.
  • 464
  • 4
  • 11
-1

In apache Nutch 2.3.1
You can save the raw HTML by edit the Nutch code firstly run the nutch in eclipse by following https://wiki.apache.org/nutch/RunNutchInEclipse

After you finish ruunning nutch in eclipse edit file FetcherReducer.java , add this code to the output method, run ant eclipse again to rebuild the class

Finally the raw html will added to reportUrl column in your database

if (content != null) {
ByteBuffer raw = fit.page.getContent();
if (raw != null) {
    ByteArrayInputStream arrayInputStream = new ByteArrayInputStream(raw.array(), raw.arrayOffset() + raw.position(), raw.remaining());
    Scanner scanner = new Scanner(arrayInputStream);
    scanner.useDelimiter("\\Z");//To read all scanner content in one String
    String data = "";
    if (scanner.hasNext()) {
        data = scanner.next();
    }
    fit.page.setReprUrl(StringUtil.cleanField(data));
    scanner.close();
}