1

In the ElasticSearch documentation for the Cardinality Aggregation under the heading "Pre-computed hashes" I see the following:

On string fields that have a high cardinality, it might be faster to store the hash of your field values in your index and then run the cardinality aggregation on this field. This can either be done by providing hash values from client-side or by letting Elasticsearch compute hash values for you by using the mapper-murmur3 plugin.

Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory. However, on numeric fields, hashing is very fast and storing the original values requires as much or less memory than storing the hashes. This is also true on low-cardinality string fields, especially given that those have an optimization in order to make sure that hashes are computed at most once per unique value per segment.

I'm curious about the part where it says, "[this can be done] by providing hash values from client-side," because it doesn't elaborate at all on that point, but goes on to discuss numeric fields.

If I wanted to pre-compute hashes on the client, would using something like xxhash and putting the result in an appropriate number field be sufficient? (And, of course, having cardinality target that field.) Or would I need to use another type of field for the hash value?

knpwrs
  • 15,691
  • 12
  • 62
  • 103

1 Answers1

1

Pre-computing hashes for high-cardinality string fields will speed up the cardinality aggregation, because hashes don't have to be computed in real-time. No need to do it on numeric fields, though!

For string fields, they advise to use the mapper-murmur3 plugin. Those hashes will be alphanumeric and should be stored in keyword fields (not a numeric field type!), that you then use in your cardinality aggregation.

I've personally seen 10x+ improvements when computing the cardinality of high-cardinality string fields with pre-computed hashes. Worth a try!

Val
  • 207,596
  • 13
  • 358
  • 360
  • I'm definitely interested in the improvements, but I'm primarily interested in how to create these hashes without the mapper-murmur3 plugin (for portability across different ES Clusters that I can't control, for instance). If I wanted to compute hashes ahead of time what kind of field should I store them in? – knpwrs Aug 25 '22 at 15:10
  • In the end it doesn't really matter which hashing algorithm you pick, just pick one that is fast enough for you and isn't expected to produce any collisions. Simply hash the value of your string field and store the hash in another keyword field, that you will use in your query. There's nothing more to it. – Val Aug 25 '22 at 15:13
  • Do hashes stored in keyword fields perform better than UUID strings in keyword fields? That's the situation I'm looking at, I currently have UUIDs and was reading the docs and wondering if I should hash to get more performance. – knpwrs Aug 26 '22 at 16:04
  • It's pretty much the same. The value itself is not used in the cardinality computation, instead each field's [ordinal value](https://www.linkedin.com/pulse/elasticsearch-understanding-terms-aggregation-mitchell-pottratz/) is used instead. So ES won't see whether it's a UUID or a hash, all what matters are the ordinal values, and they should be pretty similar for a hash and a UUID. – Val Aug 29 '22 at 14:58
  • But whether the field contains a hash or an ID or a human readable string, the cardinality aggregation just uses its ordinal value right? So why would it be faster with a hash? – Cameron Feb 14 '23 at 22:55
  • @Cameron it's mostly useful in the context of the [cardinality aggregation](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/search-aggregations-metrics-cardinality-aggregation.html#_pre_computed_hashes) and only for strings with **high cardinality**, it brings much less value for low-cardinality strings and numbers. Pre-computing hashes for high-cardinality string fields allows you to save CPU and memory at search time. – Val Feb 15 '23 at 13:03