1

I have found tens of explanation of the basic idea of LogLog algorithms, but they all lack details about how does hash function result splitting works? I mean using single hash function is not precise while using many function is too expensive. How do they overcome the problem with single hash function?

This answer is the best explanation I've found, but still have no sense for me:

They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0 to 2^10 produced 2 values: 344 and 387. You decided to have 16 buckets. So you have:

0101 011000  bucket 5 will store 1
0110 000011  bucket 6 will store 4

Could you explain example above pls? You should have 16 buckets because you have header of length 4, right? So how you can have 16 buckets with only two hashes? Do we estimate only buckets, right? So the first bucket is of size 1, and the second of size 4, right? How to merge the results?

Community
  • 1
  • 1
VB_
  • 45,112
  • 42
  • 145
  • 293

2 Answers2

2

Hash function splitting: our goal is to use many hyperloglog structures (as an example, let's say 16 hyperloglog structures, each of them using a 64-bit hash function) instead of one, in order to reduce the estimation error. An intuitive approach might be to process each of the inputs in each of these hyperloglog structures. However, in that case we would need to make sure that the hyperloglogs are independent of each other, meaning we would need a set of 16 hash functions which are independent of each other - that's hard to find!.

So we use an alternative approach. Instead of using a family of 64-bit hash functions, we will use 16 separate hyperloglog structures, each using just a 60-bit hash function. How do we do that? Easy, we take our 64-bit hash function and just ignore the first 4 bits, producing a 60-bit hash function. What do we do with the first 4 bits? We use them to choose one of 16 "buckets" (Each "bucket" is just a hyperloglog structure. Note that 2^4 bits=16 buckets). Now each of the inputs is assigned to exactly one of the 16 buckets, where a 60-bit hash function is used to calculate the hyperloglog value. So we have 16 hyperloglog structures, each using a 60-bit hash function. Assuming that we chose a decent hash function (meaning that the first 4 bits are uniformly distributed, and that they aren't correlated with the remaining 60 bits), we now have 16 independent hyperloglog structures. We take an harmonic average of their 16 estimates to get a much less error-prone estimate of the cardinality.

Hope that clears it up!

OronNavon
  • 1,293
  • 8
  • 18
  • thats for your answer! I'm almost there, but still have some doubts. Do you mean we estimate cardinality by taking in account only 16 buckets (hash-outputs)? But what's wrong with using single hash function and taking 16 outputs with max zeros prefix? Does it make an overestimate? So header is needed to choose 16 buckets indepentenly, right? And we make an estimate based on 16 buckets? – VB_ Oct 24 '16 at 06:07
  • 1
    We do use a single hash function, but note that we need the different estimators to produce independent estimates - that's how we reduce the error. If you really want to understand the practical technicalities of HyperLogLog, I highly recommend reading the original paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf – OronNavon Oct 26 '16 at 11:23
1

The original HyperLogLog paper mentioned by OronNavon is quite theoretical. If you are looking for an explanation of the cardinality estimator without the need of complex analysis, you could have a look on the paper I am currently working on: http://oertl.github.io/hyperloglog-sketch-estimation-paper. It also presents a generalization of the original estimator that does not require any special handling for small or large cardinalities.

Community
  • 1
  • 1
otmar
  • 386
  • 1
  • 9