2

I have an application that creates a rather large Solr 3.6 index, approx. 300GB with 1B documents divided into 10 cores each day. Indexing works great, and I’m using a round-robin algorithm to distribute the docs evenly between the cores. Searches work great for me too up to the point that the return result set is greater than 100K+ documents.

At that point, I get a java error returned: either OutOfMemoryError or SolrException: parsing error

My searches are simple, not using wildcards or sorting or faceted search, yet it seems to buffer the entire result set before returning it. The physical memory on my server is 256G and I am running Solaris 10. I’m using the default java in 32 bit, but have also tried java 7 in 32 and 64 bit.

When I use 64 bit java, I am able to increase the max memory enough to return 1M+ documents with the –Xmx option, but it requires practically all the memory I have for just a single Solr process.

Other than re-designing my application with hundreds of tiny indexes, does anyone have any suggestions on how to get large search result sets out of Solr without huge amounts of RAM?

Lukas Knuth
  • 25,449
  • 15
  • 83
  • 111
scottw
  • 51
  • 1
  • 3
  • How big are your documents? What are you indexing and what are you storing? What do you mean by "divided into 10 cores?" – Diego Basch Dec 26 '12 at 22:31
  • With the 32bit jvm you are going to be limited to around 3gig of heap, so if this indeed requires more than that, you absolutely are going to need 64 bits, but that seems self evident based on your ability to get things running, albeit giving up nearly your entire machine's RAM to it. See this question for some advice on how to dig into doing memory allocation analysis: http://stackoverflow.com/questions/1839599/analyze-gc-logs-for-sun-hotspots-jvm-6/1841109#1841109 – gview Dec 26 '12 at 22:34
  • 1
    I think that's a 2GB limit on 32 bit JVM: http://stackoverflow.com/questions/2457514/understanding-max-jvm-heap-size-32bit-vs-64bit – duffymo Dec 26 '12 at 22:39
  • I am indexing invoice type documents that are usually just under 1k each. When I say divided, I'm talking about multi-core. I have 10 cores configured per Solr process. There is 1 Solr process per day, allowing me to further divide the processing for search and index, and also making it easy to age off the indexes. – scottw Dec 26 '12 at 23:02
  • Why does Solr need to buffer my query results before returning them? I'm not doing any sorting or faceted searching. I can understand using a buffered reader to read from the disk, but it shouldn't require it to buffer the entire result set, should it? – scottw Dec 26 '12 at 23:23
  • Have you tried returning the docs as JSON? We had strange memory issues as well but we switched to wt=json instead of xml and so far we can return roughly 3-4 milion docs worth before crashing Solr. – Henrik Andersson Dec 26 '12 at 23:44
  • I have not tried json, but I did change wt=csv, which did improve the memory usage over xml. The results I listed above are using wt=csv. – scottw Dec 26 '12 at 23:58
  • This is not a programming question. You'd do better on the Solr user mailing list. – bmargulies Dec 27 '12 at 00:21
  • I am open to programmatic, configuration, or other solutions, which is why I thought it appropriate to post here. If anybody has a good Lucene based solution, please share. Thanks. – scottw Dec 27 '12 at 15:23

1 Answers1

2

You can try disabling various caches (such as filterCache, queryResultCache, and documentCache). This will likely hurt the performance but might give some breathing space.

If your Solr HTTP/XML responses are big, you can consider placing Solr under the same JVM or even using raw Lucene to save on XML overhead.

Other than that I'm afraid you will need to look into sharding.

mindas
  • 26,463
  • 15
  • 97
  • 154
  • I will look into adjusting or turning off these Cache parameters and let you know if/how much it improves the memory utilization in my case. Thanks. – scottw Dec 27 '12 at 00:20
  • I have tested turning off each of the cache parameters you suggested, including some additional parameters from the SolrCaching wiki, but unfortunately the results were not good. None of the parameters affected the memory utilization, as I observed using prstat. Several of the parameters did affect the performance negatively, as you predicted. Do you have any insight into why Solr buffers the entire result before writing to output, even in simple search scenarios? – scottw Dec 27 '12 at 20:10
  • I am more of a Lucene rather than Solr guy so I wouldn't know much about Solr internals. In Lucene world, though, search result only requires a very limited amount of RAM: an [array of rather inexpensive objects](http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/TopDocs.html). I'd guess Solr probably iterates over all result set and loads every document. Instead you could load documents one by one and let GC do its job. You could try running your index with [Luke](http://code.google.com/p/luke/): just do the same search and see how much memory is used. – mindas Dec 27 '12 at 21:38
  • Thank you. I will take a look at Luke. If you know of any good code examples of searching with Lucene without buffering the results, please provide. I’ll take a crack at rolling my own using Lucene rather than further sharding my application. – scottw Dec 27 '12 at 22:56
  • I highly recommend [Lucene in Action](http://www.manning.com/hatcher2/), it's got all the examples and much more. Written by authors of Lucene/Solr. – mindas Dec 28 '12 at 09:39