how to crawl a website by specifying depth

Question

I am using nutch 2.x. So I am trying to use nutch command with depth option as

$: nutch inject ./urls/seed.txt -depth 5

after executing this command getting message like

Unrecognized arg -depth

so when i got failed in this i tried to use nutch crawl as

$: nutch crawl ./urls/seed.txt -depth 5

getting error like

Command crawl is deprecated, please use bin/crawl instead

So i tried to use crawl command to crawl urls in seed.txt with the depth option in that case it is asking for solr but i am not using solr

so my question is how to crawl a website by specifying depth

score 1 · Accepted Answer · edited May 23 '17 at 11:56

1

My question is what do you want to do by crawling the page and not indexing it in SOLR?

Answer to your question:

If you want to use Nutch Crawler and you don want to index it into SOLR then remove following piece of code from crawl script.

Answer to you other question:

How to get the HTML content for all the links that has been crawled by Nutch(check this link):

This will definitely resolve your issue.

edited May 23 '17 at 11:56

Community

answered Aug 01 '14 at 10:45

Jayesh Bhoyar

first thanks for answering this works i mean it is working fine and not showing any errors but it is still not giving the html contents of outlinks means it is not going in specified depth so if you could help to getting the html contents of the outlinks as well – sachingupta Aug 04 '14 at 13:48
What is your problem : 1) Not going into the depth or 2) Not getting the HTML content Your earlier question was about Depth and I hope my solution will crawl the website in Depth. – Jayesh Bhoyar Aug 04 '14 at 13:59
okay i want to get the html contents of all the outlinks on a particular page to specified depth in my hbase database with urls as keys and the html content will be in f:cnt column have you got it or you want me to elaborate it a little more. – sachingupta Aug 04 '14 at 16:31
but if you could help me like how can i make nutch give me html content of the outlinks as well – sachingupta Aug 05 '14 at 06:12
The Nutch crawldb/segmentdb already contains the html content of all the outlinks that nutch has crawled.... You need to investigate more on that how you can get that content into other destination other than solr. btw, I guess I have answered your original question on how to skip solr and crawl the site in N depth. If you accept that as answer it will good for others to get answer for same question. – Jayesh Bhoyar Aug 05 '14 at 06:21

1 Answers1