1

I'm working to get nutch running for the first time for a work project. At this time, the plan is to run nutch from a single machine (Windows 7) to scrape context from a dozen or so web sites. Below is the command line output from cygwin.

$ bin/nutch inject crawl/crawldb urls
Injector: starting at 2016-10-29 09:16:37
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
        at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
        at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
        at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
        at org.apache.nutch.crawl.Injector.run(Injector.java:467)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:441)

Looking through the source, here are lines 440 thru 443 of org.apache.nutch.crawl.Injector:

  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(NutchConfiguration.create(), new Injector(), args);
    System.exit(res);
  }

It's not clear exactly whether it is the NutchConfiguration.create() or the new Injector() which is failing there. I setup my installation from the tutorial on the nutch site. I put a list of 3 urls, 1 per line, in the file ./urls/seed.txt; and edited ./conf/nutch-site.xml.

Any suggestions for investigation/debugging this would be appreciated. Thank you!

Stu
  • 15
  • 8
  • After searching through all the source, it appears that org.apache.hadoop.fs.FileSystem is not included in the binary distribution. So my next step will be to download and install hadoop and include its libraries in the classpath for running nutch. I'll let you know how that goes. – Stu Oct 30 '16 at 13:47
  • No change to the command line results after installing hadoop and adding the hadoop jars to the classpath. Onto other ideas. Nutch's hadoop log shows a problem locating the executable null\bin\winutils.exe. I see there are other stackoverflow questions and comments about that. I'll explore that route. – Stu Oct 30 '16 at 21:04
  • There is a very good Q&A for the problem with winutils.exe at: (http://stackoverflow.com/questions/35652665/java-io-ioexception-could-not-locate-executable-null-bin-winutils-exe-in-the-ha) – Stu Nov 02 '16 at 20:03
  • For convenience, here is the final solution to hadoop's inability to locate the winutils.exe: A. Download winutils.exe from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe B. Set your HADOOP_HOME environment variable using Windows control panel > System > Advanced system settings > Environment Variables to HADOOP_HOME = C:\winutils assuming that the file itself is found at: C:\winutils\bin\winutils.exe With that resolved, there is now an error with java.lang.UnsatisfiedLinkError. I have seen posts on that and will continue with that route. – Stu Nov 02 '16 at 20:12
  • Did you end up resolving this? I just started doing what you stated and am hitting this same exact issue. – fujiiface Mar 15 '17 at 22:56

2 Answers2

1

Ok After somewhat struggling here are the final steps to get hadoop working with cygwin/windows.

  1. download the right version of winutils.exe and hadoop.dll under a folder bin from https://github.com/cdarlint/winutils based on hadoop version.

  2. set HADOOP_HOME to the download dir of bin folder above. (note if the above two files are downloaded in dir D:\winutil\bin then HADOOP_HOME = D:\winutil)

  3. make sure to add D:\winutil\bin to the PATH variable of windows. This step is important now (was not a while back).

Sachin Mittal
  • 98
  • 2
  • 8
-1

I had the same issue. Solved it by setting up Hadoop in machine and included winutils.exe in %HADOOP%/bin.

Then will get java.lang.UnsatisfiedLinkError error. To solve that, open nutch file in %NUTCH_HOME%/runtime/local/bin and comment below lines

if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
  NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Djava.library.path="$JAVA_LIBRARY_PATH")
fi
nr spider
  • 134
  • 1
  • 12