Too many collisions in hash function

Question

I was trying to hash about 64million 64bit unique unsigned integers to 128 million buckets(27bit wide address). I tried Bob Jenkin's HashLittle and Murmur hash(Both these hash functions gives 32bit hashes which I masked to obtain 27bit address). In both the cases it resulted in about 22% of collisions and in the end only occupied 37% of buckets. Is this expected or am I doing something wrong ? I was expecting far less collisions and better occupation of buckets.

Instead of masking - by which I assume you mean AND'ing with 0x7FFFFFF - try taking the modulus of the greatest prime <= 0x7FFFFFF. — 500 - Internal Server Error, Aug 09 '14 at 18:00
Most renowned hash functions are designed for hashing short to moderately long strings. For integer data, the value itself is often the best hash, and for reducing 64 bits to 32 bits you'd use a mixing functions otherwise used for aggregating hash values. — , Aug 09 '14 at 18:02
Might be worth playing with some of the hash functions in http://stackoverflow.com/questions/664014/what-integer-hash-function-are-good-that-accepts-an-integer-hash-key? — NPE, Aug 09 '14 at 18:11
If you're on Intel, something based on CRC-32 might be worth a try: `_mm_crc32_u32()`. See http://stackoverflow.com/a/3045334/2809095. — Iwillnotexist Idonotexist, Aug 10 '14 at 04:53
Could someone give some references on designing mixing functions ? — Jean, Aug 24 '14 at 02:03
Not sure if this is helpful, but you might want to take a look at [CMPH](http://cmph.sourceforge.net/index.html), A library for constructing minimal perfect hash functions for large sets. — Hasturkun, Aug 24 '14 at 09:57

score 6 · Answer 1 · answered Aug 09 '14 at 18:53

It looks slightly worse than I would expect at random, using an approximation based on the http://en.wikipedia.org/wiki/Poisson_distribution. If the expected number of entries in a bucket is 1/2, I would expect that the probability of 0 entries is about exp(-0.5) = 0.607, and the probability of 1 entry in a bucket is about half this, or 0.303. This leaves probability 0.09 that a bucket has two or more entries.

Are your integers all unique? If not, are you counting duplicate values as causing a hash collision?

In favourable circumstances, you can choose a hash function so as to give FEWER collisions that you would expect at random. Sometimes hash(x) = x % p, where p is a prime, will achieve this.

score 1 · Answer 2 · answered Aug 09 '14 at 19:47

If you want to get "random but repeatable" results - which have the best worst-case collision rates even for deliberately difficult inputs* - you can simply create a table like:

uint32_t r[8][256];

Populate it using 8kb of random data - you can google for a website with random data to download and reformat it for inclusion in your source or loading at runtime from file.

(*) - as long as the inputs aren't created by someone malicious who knows your random data too.

Then hash like this:

uint32_t hash(uint64_t n)
{
    unsigned char* p = (unsigned char*)&n;
    return r[0][p[0]] ^ r[1][p[1]] ^ r[2][p[2]] ^ r[3][p[3]] ^
           r[4][p[4]] ^ r[5][p[5]] ^ r[6][p[6]] ^ r[7][p[7]];
}

Of course, better worst-case collisions is often a very different thing from better real-world performance - a lot depends on your data set and hardware - so it's just something to benchmark if you really care. Do benchmark simple pass-through as well. Using a prime number of buckets is very good practice but might be tricky depending on your hash table - e.g. some implementations may round any resize request up to a power of two.

Too many collisions in hash function

2 Answers2