26

I've got the following question about choosing hash functions for Bloom filters:

  • Which functions to use?

In nearly every document/paper you can read that the hash functions used in a Bloom filter should be independent and uniformly distributed.

I know what is meant by this (independent and uniformly distributed), but I'm having trouble to find a argumentation or a discussion, which hash functions fulfill those requirements and are therefore suitable. In a lot of posts I've read about suggestions for the usage of the FNV or Murmur hash function, but not why (or at least without a proof) they are suitable.

Thanks in advance!

Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132
Torsten
  • 281
  • 1
  • 4
  • 6
  • Possible duplicate of [Generating Random Hash Functions for LSH Minhash Algorithm](https://stackoverflow.com/questions/24676237/generating-random-hash-functions-for-lsh-minhash-algorithm) – Jim Mischel Jul 24 '18 at 18:02

3 Answers3

25

I asked myself the same question when building a Java Bloom filter library. See the Github readme for a detailed treatment of my analysis of hash functions for Bloom filters.

I looked at the problem from two perspectives:

  • How fast is the computation?
  • How uniform is the output distribution?

Speed can easily be measured by benchmarks on random input. Uniformity is a bit harder and requires some statistics. Using Chi-Square goodness of fit tests I measured how similar the distribution of hash values is to a uniform distribution.

The result is:

  • Use Murmur3 for the best trade-off between speed and uniformity. Do not use Murmur2 as it is not uniform for inputs that change in small increments.
  • Use a cryptographic hash function like SHA-256 for the best uniformity.
  • Apply the Kirsch-Mitzenmacher-Optimization to only compute 2 instead of k hash functions (hash_i = hash1 + i x hash2).

If your implementation is using Java I would recommend using our Bloom filter hash library. It is well documented and thoroughly tested. For the details, including the benchmark results for different hash function and their unformity according to Chi-Square test, see the Github readme of the repo.

Max Hille
  • 661
  • 7
  • 18
DivineTraube
  • 691
  • 5
  • 9
  • 1
    I didn't read [Kirsch-Mitzenmacher-Optimization](https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf) thoroughly but in the paper hash_i = hash1 + i x hash2 % p, where p is a prime, hash1 and hash2 is within range of [0, p-1], and the bitset consists k * p bits. – cyber4ron Jun 06 '17 at 16:28
  • Is the modulus only on the second hash? Also, the Kirsch-Mitzenmacher-Optimization has 2 hashes of equal bits but in this case one is 128 bits and the second is 256 bits. Will that lead to bias? – Terence Chow Mar 08 '19 at 06:04
  • @TerenceChow they used a Parallel Partitioned Bloom Filter with k different hash tables of p elements with associated hash_i function. The total number of bits used is k * p. – ZAB May 22 '20 at 12:02
5

Hash Functions should provide you with graphical proof of why FNV would be a bad choice, and why Murmur2 or one of Bob Jenkins' Hashes would be a good choice.

Russell Stuart
  • 590
  • 5
  • 11
Guy Gordon
  • 740
  • 6
  • 13
0

I think a reasonable option would be multiple CRC hashes. I'm assuming that, if you want multiple n-bit hash values, then for polynomials with Boolean field coefficients, there are multiple prime polynomials of degree n+1. But I don't know of a process for finding these polynomials.

Another possibility would be to use multiple modulo hashes. The size of the Bloom Filter bit array would have to be the maximum modulo value. But I think, for it to work well, the modulus values would have to be product of primes greater than 10, and relatively prime to each other. And the range from the minimum to the maximum modulus value would have to be as small as possible. I don't know of a way to find such values. I have written some open-source C++ code for quick calculation of remainders: https://github.com/wkaras/C-plus-plus-intrusive-container-templates/blob/master/modulus_hash.h

WaltK
  • 724
  • 4
  • 13