Installing Apache Nutch on Windows

Question

I'm trying to integrate Apache Solr with Apache Nutch 1.14 on Windows 7 (64 bit), but I'm getting an error while trying to run Nutch.

Things I already did:

Setting the JAVA_HOME env variable to: C:\Program Files\Java\jdk1.8.0_25, or C:\Progra~1\Java\jdk1.8.0_25
Downloading the Hadoop WinUtils files from: https://github.com/steveloughran/winutils/tree/master/hadoop-3.0.0/bin, putting them in c:\winutils\bin, setting the HADOOP_HOME env variable to c:\winutil, and adding the c:\winutil\bin folder to PATH.

(I tried the Hadoop WinUtils 2.7.1 with no success too).

The error I'm getting:

$ bin/crawl -i -D http://localhost:8983/solr/ -s urls/ TestCrawl 2
  Injecting seed URLs
  /home/apache-nutch-1.14/bin/nutch inject TestCrawl/crawldb urls/
  Injector: starting at 2018-06-20 07:14:47
  Injector: crawlDb: TestCrawl/crawldb
  Injector: urlDir: urls
  Injector: Converting injected urls to crawl db entries.
  Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:609)
    at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
    at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187)
    at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115)
    at org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:125)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:163)
    at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:240)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:417)
    at org.apache.nutch.crawl.Injector.run(Injector.java:563)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:528)
  Error running:
    /home/apache-nutch-1.14/bin/nutch inject TestCrawl/crawldb urls/
  Failed with exit value 1.

After downloading the hadoop-core-1.1.2.jar file from: http://www.java2s.com/Code/Jar/h/Downloadhadoopcore121jar.htm , and pasting it in the NUTCH_HOME/lib folder, I'm getting the following error:

$ bin/crawl -i -D http://localhost:8983/solr/ -s urls/ TestCrawl 2
  Injecting seed URLs
  /home/apache-nutch-1.14/bin/nutch inject TestCrawl/crawldb urls/
  Injector: starting at 2018-06-20 23:19:49
  Injector: crawlDb: TestCrawl/crawldb
  Injector: urlDir: urls
  Injector: Converting injected urls to crawl db entries.
  Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.Job.getInstance(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/String;)Lorg/apache/hadoop/mapreduce/Job;
    at org.apache.nutch.crawl.Injector.inject(Injector.java:401)
    at org.apache.nutch.crawl.Injector.run(Injector.java:563)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:528)
  Error running:
    /home/apache-nutch-1.14/bin/nutch inject TestCrawl/crawldb urls/
  Failed with exit value 1.

If I don't set the HADOOP_HOME variable, I'm getting the following exception:

Injector: java.io.IOException: (null) entry in command string: null chmod 0644 C:\cygwin64\home\apache-nutch-1.14\TestCrawl\crawldb\.locked
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)
    at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)
    at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)
    at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:854)
    at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1154)
    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:59)
    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:81)
    at org.apache.nutch.crawl.CrawlDb.lock(CrawlDb.java:178)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:398)
    at org.apache.nutch.crawl.Injector.run(Injector.java:563)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:528)

  Error running:
    /home/apache-nutch-1.14/bin/nutch inject TestCrawl//crawldb urls/
  Failed with exit value 127.

I would really appreciate any help I can get!

In addition, I tried the Hadoop WinUtils version 2.7.1: https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin , with no success — Scooby-Doo, Jun 21 '18 at 06:23
Which Hadoop version are you actually running? That'll include a Hadoop-core jar file, so no need to download that yourself — OneCricketeer, Jun 21 '18 at 14:07
I followed the tutorial: https://wiki.apache.org/nutch/NutchTutorial , and it didn't say anything about installing Hadoop. Do you think I need to? And if so, how can I install it properly on Windows? Thanks! — Scooby-Doo, Jun 21 '18 at 14:28
Well, you said you have HADOOP_HOME variable, which would imply you've downloaded Hadoop binaries, not just winutils — OneCricketeer, Jun 21 '18 at 14:30
Basically, you're getting `NoSuchMethodError` because some version of everything you've downloaded is expecting one version of Hadoop, but you've given it something else — OneCricketeer, Jun 21 '18 at 14:31
I manually set the HADOOP_HOME like someone suggested here: https://stackoverflow.com/a/39525952/6667558 — Scooby-Doo, Jun 21 '18 at 14:34
Nothing here says you need that, though https://wiki.apache.org/nutch/NutchTutorial — OneCricketeer, Jun 21 '18 at 14:37
In any case, if you have Hadoop 2.7.1 executables, you must use the corresponding JAR files http://central.maven.org/maven2/org/apache/hadoop/hadoop-common/2.7.1/hadoop-common-2.7.1.jar — OneCricketeer, Jun 21 '18 at 14:40
You are correct, the tutorial doesn't say anything about setting the HADOOP_HOME variable. But if I don't set it, I'm getting a different error. I updated my post and added the new exception I'm getting. I would really appreciate it if you could take a look at it. Thanks! — Scooby-Doo, Jun 21 '18 at 14:47

Quent · Answer 1 · 2018-06-21T14:33:48.527

0

When you execute Crawl just execute this following command

bin/crawl -s urls/ TestCrawl/ 2

And after you can use this (-D with class)

bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/YOURCORE TestCrawl/crawldb/ -linkdb TestCrawl/linkdb/ TestCrawl/segments/* -filter -normalize -deleteGone

Or you can specify in conf/nutch-site.xml

<property>
    <name>solr.server.url</name>
    <value>http://localhost:8983/solr/YOURCORE/</value>
    <description>Defines the Solr URL into which data should be indexed using the indexer-solr plugin.</description>
</property>

edited Jun 21 '18 at 14:33

answered Jun 21 '18 at 11:49

Quent

73
10

I'm getting the same error when trying to run the first command: bin/crawl -s urls/ TestCrawl/ 2 – Scooby-Doo Jun 21 '18 at 14:29
I can't comment your first post but you must delete .locked file – Quent Jun 21 '18 at 14:59
But I can't - it's being generated automatically when running the first command – Scooby-Doo Jun 21 '18 at 15:02

Installing Apache Nutch on Windows

1 Answers1