Questions tagged [nutch]

Nutch is a well matured, production ready Web crawler. Nutch enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.

Nutch is open source web-search software. It builds on top of Hadoop adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

The system can be enhanced (eg other document formats can be parsed or custom information extracted) using a plugin mechanism.

For more information about Nutch, please see the Nutch wiki.

Nutch has a mailing list, a place where users can post questions and developers can respond. Sometimes, it is faster to get a reply over there.

How Nutch Works :

1571 questions

votes

5 answers

An alternative web crawler to Nutch

I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution I came up with is: using Nutch as the web crawler, using Solr as the search engine, the front-end and the site logic is coded with…

search-engine web-crawler nutch

asked Nov 24 '10 at 17:24

wassimans

8,382
10
47
58

votes

3 answers

How to crawl a website that has SAML authentication using ManifoldCF or nutch?

I am trying to crawl a website, more specifically a Google Site using ManifoldCF that has SAML authentication and index the crawled data into Apache Solr. But as I crawl the URL, it gives me 302 redirection to login page and then says…

solr saml nutch full-text-indexing manifoldcf

asked Aug 08 '16 at 14:07

Saurabh Chaturvedi

2,028
2
18
39

votes

3 answers

no segments* file found

I need to access a lucene index ( created by crawling several webpages using Nutch) but it is giving the error shown above : java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.FSDirectory@/home/: files: at…

java lucene nutch

asked Sep 27 '10 at 08:06

crazyaboutliv

3,029
9
33
50

votes

2 answers

Insufficient space for shared memory file when I try to run nutch generate command

I have been running nutch crawling commands for the passed 3 weeks and now I get the below error when I try to run any nutch command: Java HotSpot(TM) 64-Bit Server VM warning: Insufficient space for shared memory file: …

java jvm nutch

asked Jan 12 '13 at 05:19

peter

3,411
5
24
27

votes

3 answers

How is an aggregator built?

Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever). How would I do that? Have a spider/crawler who will crawl the web for finding the information I need (how would I…

web-services aggregation web-crawler nutch

asked May 29 '09 at 22:36

Mircea

votes

3 answers

Using Nutch crawler with Solr

Am I able to integrate Apache Nutch crawler with the Solr Index server? Edit: One of our devs came up with a solution from these posts Running Nutch and Solr Update for Running Nutch and Solr Answer Yes

lucene solr nutch

asked Oct 17 '08 at 08:32

Scott Cowan

2,652
7
29
45

votes

2 answers

Nutch No agents listed in 'http.agent.name'

Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) at…

web-crawler nutch

asked Jul 05 '11 at 12:51

LinuxBill

votes

1 answer

Apache Nutch and Solr integration

I've tried to follow the nutch tutorial but having a bit of a problem with the schema.xml file. I was told to the nutch provided schema to my project, essentially this... cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml…

linux solr lucene nutch

asked Apr 11 '13 at 10:02

Carlton

5,533
4
54
73

votes

2 answers

get out links from nutch

I am using nutch 1.3 to crawl a website. I want to get a list of urls crawled, and urls originating from a page. I get list of urls crawled using readdb command. bin/nutch readdb crawl/crawldb -dump file Is there a way to find out urls that are on…

web-crawler nutch

asked Sep 15 '11 at 02:13

surajz

3,471
3
32
38

votes

1 answer

How to run apache nutch different jobs in parallel manner

I am using nutch 2.3. All jobs run one after other i.e. first generator, fetch, parse, index etc. I want to run some jobs simultaneously. I know some jobs cannot run in parallel but other can e.g parse job, dbupdate, indexjob should be run with…

java apache web-crawler nutch

asked May 05 '15 at 06:35

Hafiz Muhammad Shafiq

8,168
12
63
121

votes

3 answers

Recrawl URL with Nutch just for updated sites

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?

apache solr lucene nutch web-crawler

asked Jan 10 '13 at 15:40

Ilce MKD

votes

1 answer

How to extend Nutch for article crawling

I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my plan and questions in each: 1 Add article list pages into url/seed.txt Here's one problem. What I actually want to be indexed is the article pages, not the article list…

web-crawler nutch

asked Dec 15 '12 at 15:13

user1633272

2,007
5
25
48

votes

2 answers

nutch vs solr indexing

I have recently started working on nutch and I am trying to understand how it works. As far as I know Nutch is basically used to crawl the web and solr/Lucene is used to index and search. But when I read documentation on nutch, it says that nutch…

solr lucene nutch

asked Jun 01 '12 at 05:18

CRS

votes

1 answer

Apache Nutch - Problems with Paths

I am trying to set up Apache Nutch to crawl URLs, following this guide. Being an older guide (The guide is for 1.x, I am using 2.3), I have made the necessary changes to structure. However, when I try to run a crawl, I get this error…

java apache nutch

asked Nov 15 '15 at 08:50

Sainath Krishnan

2,089
7
28
43

votes

1 answer

Nutch versus Solr

Currently collecting information where I should use Nutch with Solr (domain - vertical web search). Could you suggest me?

solr nutch

asked May 12 '10 at 11:00

Jeriho

7,129
9
41
57

2 3

…

99 100 Next