apache nutch crawler - keeps retrieve only single url

Question

INJECT step keeps retrieving only single URL - trying to crawl CNN. I'm with default config (below is the nutch-site) - what could that be - shouldn't it be 10 docs according to my value?

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>crawler1</value>
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
    <description>Default class for storing data</description>
  </property>
  <property>
        <name>solr.server.url</name>
        <value>http://x.x.x.x:8983/solr/collection1</value>
  </property>
<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-reg
ex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|m
etatags)|index-(basic|anchor|more|metadata)</value>
</property>
<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
</property>
<property>
  <name>generate.max.count</name>
  <value>10</value>
</property>
</configuration>

score 0 · Accepted Answer · answered May 21 '16 at 07:53

0

Nutch crawl consists of 4 basic steps: Generate, Fetch, Parse and Update DB. These steps are the same for both nutch 1.x and nutch 2.x. Execution and completion of all four steps make one crawl cycle.

Injector can be the very first step that adds the URL to the crawldb; as stated here and here.

To populate initial rows for the webtable you can use the InjectorJob.

Which I reckon you have already provided i.e cnn.com

generate.max.count limits the number of URLs to be fetched form the single domain as stated here.

Now what matters is how many URLs from cnn.com your crawldb has.

Option 1

You have generate.max.count = 10 and you have seeded or injected more than 10 URLs to the crawldb then on executing crawl cycle, nutch should fetch no more than 10 URLs

Option 2

If you have injected only one URL and you have performed only one crawl cycle then on first cycle you will get only one document processed because only one URL was in your crawldb. Your crawldb will be update at the end of each crawl cycle. So on execution of your second crawl cycle and third crawl cycle and so on, nutch should resolve only up to 10 URLs from a specific domain.

answered May 21 '16 at 07:53

m5khan

2,667
1
27
37

thanks you are right. I now understand that I have to do couple of cycles - thanks for the really useful links! – user1025852 May 22 '16 at 05:12
related question - let's say I want every hour to scan "new content" (for example from CNN\Politics section. currently the behavior is to fetch more from the existing urls - which leads me to more and more old articles. is there a way to "purge" the inner DB every iteration - and always start with CNN\Politics for example? – user1025852 May 22 '16 at 18:08
To start crawl from specific page, you should "inject" the same url i.e cnn.com/politics. You can also edit your regex-urlfilter.txt file to mention what urls to hit. If you want to forget old information, change/remove the directory which holds your crawl information i.e crawldb, segments, linkdb etc you provide during your crawl commands to nutch – m5khan May 23 '16 at 06:40
thanks! I'm using nutch in REST api, will try to see if I can do what you've suggested in programmatic way :) – user1025852 May 23 '16 at 11:23
follow-up question - if I use different crawl-id - will it "re-create" new space for the relevant jobs? (new DB etc.) – user1025852 May 23 '16 at 19:10
Checkout the link : https://wiki.apache.org/nutch/bin/nutch%20generate Though I have not explored nutch 2.x but what I reckon is that it will assign new identifier and will cause "re-creation" of new space. It seems similar to nutch 1.x . You can try for yourself by changing the id and it will be great if you will share the results with me :) – m5khan May 24 '16 at 06:26
still investigating..seems like after I delete the folder in /tmp/hadoop- - new "space" is created. I hope I examined that right – user1025852 May 26 '16 at 08:23

apache nutch crawler - keeps retrieve only single url

1 Answers1

Linked