0

I'm running Apache Nutch, which seems to work and in small runs will index documents and commit to Solr at the end of the run.

Unfortunately, I want to index deep within some large sites and Nutch won't commit to the end of a run.

This has obvious issues when you're looking at 100k+ documents being stacked up waiting to commit with pressure on memory, having to wait so long for the data, etc.

Is there a way to persuade Nutch to commit more frequently?

rich
  • 18,987
  • 11
  • 75
  • 101

1 Answers1

3

There is a configuration parameter in nutch named "solr.commit.size" which according to the description in nutch-default.xml is:

Defines the number of documents to send to Solr in a single update batch. Decrease when handling very large documents to prevent Nutch from running out of memory. NOTE: It does not explicitly trigger a server side commit.

As it said, it does not explicitly commit, because it is more optimized to left the decision of commit times to solr. So you should also tune your solr configuration parameters: autoCommit and autoSoftCommit. You can find their descriptions in solrconfig.xml file.

tahagh
  • 777
  • 7
  • 8
  • I agree with the later part, configure Solr's `autoCommit` and probably `autoSoftCommit` - if you want to search in between. Further reading on this http://stackoverflow.com/questions/15667748/how-to-configure-solr-for-improved-indexing-speed and http://t.co/2I9cmHli3H – cheffe Jan 05 '14 at 11:49
  • I've tried these, but see no signs of nutch commiting to solr. The default seems to be 250 for solr.commit.size but I'm getting through over 100k documents without any solr logs triggered by nutch, or nutch showing the details of solr commits in its log. – rich Jan 07 '14 at 17:55
  • With the default settings, if you use the crawl command, it will fetch 100K documents at each iteration during the fetch step, but at the index step it will send the documents in the groups of size 250 to solr. So I think the 100k documents you mentioned are those documents that are fetched and stored in the segment but still not indexed. – tahagh Jan 08 '14 at 05:23
  • I can see now that at the end of the crawl the solr.commit.size controls the batches pushed into solr. Getting it to commit during the crawl is the issue though... – rich Jan 08 '14 at 22:54