Why using powers of 2 as the hash size makes a hash table considerably worse than using primes?

Question

I'm implementing a hash table that is supposed to store pairs of 32-bit values. Considering my elements are fixed size, I'm using a very simple hashing function:

hash(a,b) = asUint64(a) + (asUint64(b) << 32)

With that, I get the index of an element in a hash table (that is, its corresponding bucket) with:

index(a,b) = hash(a,b) % hash_size

Where hash_size is the number of entries/buckets on my table. I've realized, though, that I could speed up this implementation a little bit if I replaced the "modulus" operator by a bitwise mod of 2, by fixing hash_size as a power of 2. Except, when I do that, most of my pairs end up on the first bucket! Why is that happening?

if you mod by power of 2, you are destroying a lot of the data that is present in `b`, and possibly `a`. — Ben, Oct 17 '14 at 21:17
It is odd that you say most of your pairs end up on the first bucket. I expect more collisions with this scheme, but I suspect your case is more due the distribution of your data. — Ben, Oct 17 '14 at 21:42
Rehash your hash with [MurmurHash3](https://code.google.com/p/smhasher/wiki/MurmurHash3) avalanche function and continue using power of 2 size. — leventov, Oct 17 '14 at 23:52
Also you can find [this answer](http://stackoverflow.com/a/26353689/648955) useful — leventov, Oct 17 '14 at 23:53
Read about [Bezout Identity](https://en.wikipedia.org/wiki/B%C3%A9zout%27s_identity) — Basile Starynkevitch, Nov 02 '14 at 08:08

score 2 · Accepted Answer · answered Oct 17 '14 at 21:37

My guess is that your data is not evenly distributed in a. Consider the concatenation of a and b as your hash code:

b31b30...b1b0a31a30...a1a0, where ai, bi is the ith bit of a,b

Suppose you have a table with a million entries, your hash index is then

a9a8...a1a0 (as an integer)

Worse, suppose a only ever ranges from 1 to 100. Then you have even less dependence on the higher order bits of a.

As you can see, if your hash table doesn't have at least 4 billion entries, your hashcode has no dependence on b at all, and hash(x, a) will collide for all x.

That is enlightening, thanks. This is not on the scope of the question so feel free not to answer it, but what hashing function would you recommend on this case? — MaiaVictor, Oct 17 '14 at 21:53

Why using powers of 2 as the hash size makes a hash table considerably worse than using primes?

1 Answers1