2

My question is pretty similar to this question
The difference, I'd need the least RAM intensive way to gather information about the distinct values. I DON'T care for the actual count in this case, I just want to know the possible values for that field.
I'm constantly running out of heap space (30 million+ documents) and there has to be some way/parameter to do this in a memory saving way

Community
  • 1
  • 1
Marc Seeger
  • 2,717
  • 4
  • 28
  • 32

3 Answers3

1

Use the StatsComponenet to retrieve a list of distinct values for a certain field: https://cwiki.apache.org/confluence/display/solr/The+Stats+Component

Parameter stats.calcdistinct:

If true, distinct values will be calculated and returned as "countDistinct" and "distinctValues" in the response. This calculation may be expensive for some fields, so it is false by default. If you'd only like to return distinct values for specific fields, you can also specify f..stats.calcdistinct, replacing with your field name, to limit the distinct value calculation to the required field.

To keep the load down, retrieve it as few times as possible and cache the results and only retrieve again when the data has changed.

If your index is slow in general you might want to have a look at the cache configuration and/or give SOLR more RAM (if you have the means).

Originally answered here (by me):

https://stackoverflow.com/a/26714447/621690

Community
  • 1
  • 1
Risadinha
  • 16,058
  • 2
  • 88
  • 91
  • That options is not available in v3.x. Is there an answer for v3.x? – Scott Chu Nov 05 '15 at 09:22
  • @ScottChu do you mean Solr 1.3.x? That really is old, it's a long time I've worked with that version. I would think you can achieve it using the terms component even with 1.3 because Luke (the Solr Admin) had this information even then if I remember correctly. – Risadinha Nov 05 '15 at 11:36
  • No! I mean Solr 3.x. We have an old Solr 3.5 on product. I tried your answer but it doesn't work! – Scott Chu Nov 12 '15 at 02:05
  • Have you tried the different local parameters that are documented on the linked wiki page? They also state "`calcDistinct` - for backwards compatibility, `calcDistinct=true` may be specified as an alias for both `countDistinct=true distinctValues=true`". I'm quite confident you can find a solution with version 3.5. – Risadinha Nov 12 '15 at 12:16
1

If the number of distinct values is high, you will probably need to do facet paging. Use the facet.offset and facet.limit parameters.

Pascal Dimassimo
  • 6,908
  • 1
  • 37
  • 34
0

I don't know about RAM usage, but you might wanna try Field collapsing You will find the patch for Solr here.

Jem
  • 551
  • 3
  • 2
  • That seems to be only relevant for the result set. I don't let solr return any rows. I'm only interested in the facet fields – Marc Seeger Jul 16 '10 at 09:03