Questions tagged [nutch]

Nutch is a well matured, production ready Web crawler. Nutch enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.

Nutch is open source web-search software. It builds on top of Hadoop adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

The system can be enhanced (eg other document formats can be parsed or custom information extracted) using a plugin mechanism.

For more information about Nutch, please see the Nutch wiki.

Nutch has a mailing list, a place where users can post questions and developers can respond. Sometimes, it is faster to get a reply over there.

How Nutch Works : How Nutch Works !!

1571 questions
20
votes
5 answers

An alternative web crawler to Nutch

I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution I came up with is: using Nutch as the web crawler, using Solr as the search engine, the front-end and the site logic is coded with…
wassimans
  • 8,382
  • 10
  • 47
  • 58
18
votes
3 answers

How to crawl a website that has SAML authentication using ManifoldCF or nutch?

I am trying to crawl a website, more specifically a Google Site using ManifoldCF that has SAML authentication and index the crawled data into Apache Solr. But as I crawl the URL, it gives me 302 redirection to login page and then says…
Saurabh Chaturvedi
  • 2,028
  • 2
  • 18
  • 39
16
votes
3 answers

no segments* file found

I need to access a lucene index ( created by crawling several webpages using Nutch) but it is giving the error shown above : java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.FSDirectory@/home/: files: at…
crazyaboutliv
  • 3,029
  • 9
  • 33
  • 50
15
votes
2 answers

Insufficient space for shared memory file when I try to run nutch generate command

I have been running nutch crawling commands for the passed 3 weeks and now I get the below error when I try to run any nutch command: Java HotSpot(TM) 64-Bit Server VM warning: Insufficient space for shared memory file: …
peter
  • 3,411
  • 5
  • 24
  • 27
14
votes
3 answers

How is an aggregator built?

Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever). How would I do that? Have a spider/crawler who will crawl the web for finding the information I need (how would I…
Mircea
13
votes
3 answers

Using Nutch crawler with Solr

Am I able to integrate Apache Nutch crawler with the Solr Index server? Edit: One of our devs came up with a solution from these posts Running Nutch and Solr Update for Running Nutch and Solr Answer Yes
Scott Cowan
  • 2,652
  • 7
  • 29
  • 45
12
votes
2 answers

Nutch No agents listed in 'http.agent.name'

Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) at…
LinuxBill
  • 415
  • 1
  • 8
  • 19
11
votes
1 answer

Apache Nutch and Solr integration

I've tried to follow the nutch tutorial but having a bit of a problem with the schema.xml file. I was told to the nutch provided schema to my project, essentially this... cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml…
Carlton
  • 5,533
  • 4
  • 54
  • 73
10
votes
2 answers

get out links from nutch

I am using nutch 1.3 to crawl a website. I want to get a list of urls crawled, and urls originating from a page. I get list of urls crawled using readdb command. bin/nutch readdb crawl/crawldb -dump file Is there a way to find out urls that are on…
surajz
  • 3,471
  • 3
  • 32
  • 38
10
votes
1 answer

How to run apache nutch different jobs in parallel manner

I am using nutch 2.3. All jobs run one after other i.e. first generator, fetch, parse, index etc. I want to run some jobs simultaneously. I know some jobs cannot run in parallel but other can e.g parse job, dbupdate, indexjob should be run with…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
10
votes
3 answers

Recrawl URL with Nutch just for updated sites

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?
Ilce MKD
  • 245
  • 3
  • 7
10
votes
1 answer

How to extend Nutch for article crawling

I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my plan and questions in each: 1 Add article list pages into url/seed.txt Here's one problem. What I actually want to be indexed is the article pages, not the article list…
user1633272
  • 2,007
  • 5
  • 25
  • 48
10
votes
2 answers

nutch vs solr indexing

I have recently started working on nutch and I am trying to understand how it works. As far as I know Nutch is basically used to crawl the web and solr/Lucene is used to index and search. But when I read documentation on nutch, it says that nutch…
CRS
  • 471
  • 9
  • 23
9
votes
1 answer

Apache Nutch - Problems with Paths

I am trying to set up Apache Nutch to crawl URLs, following this guide. Being an older guide (The guide is for 1.x, I am using 2.3), I have made the necessary changes to structure. However, when I try to run a crawl, I get this error…
Sainath Krishnan
  • 2,089
  • 7
  • 28
  • 43
9
votes
1 answer

Nutch versus Solr

Currently collecting information where I should use Nutch with Solr (domain - vertical web search). Could you suggest me?
Jeriho
  • 7,129
  • 9
  • 41
  • 57
1
2 3
99 100