0

I'm implementing a hash table that is supposed to store pairs of 32-bit values. Considering my elements are fixed size, I'm using a very simple hashing function:

hash(a,b) = asUint64(a) + (asUint64(b) << 32)

With that, I get the index of an element in a hash table (that is, its corresponding bucket) with:

index(a,b) = hash(a,b) % hash_size

Where hash_size is the number of entries/buckets on my table. I've realized, though, that I could speed up this implementation a little bit if I replaced the "modulus" operator by a bitwise mod of 2, by fixing hash_size as a power of 2. Except, when I do that, most of my pairs end up on the first bucket! Why is that happening?

Community
  • 1
  • 1
MaiaVictor
  • 51,090
  • 44
  • 144
  • 286
  • if you mod by power of 2, you are destroying a lot of the data that is present in `b`, and possibly `a`. – Ben Oct 17 '14 at 21:17
  • I guess that sounds reasonable, but why? – MaiaVictor Oct 17 '14 at 21:25
  • It is odd that you say most of your pairs end up on the first bucket. I expect more collisions with this scheme, but I suspect your case is more due the distribution of your data. – Ben Oct 17 '14 at 21:42
  • 1
    Rehash your hash with [MurmurHash3](https://code.google.com/p/smhasher/wiki/MurmurHash3) avalanche function and continue using power of 2 size. – leventov Oct 17 '14 at 23:52
  • 1
    Also you can find [this answer](http://stackoverflow.com/a/26353689/648955) useful – leventov Oct 17 '14 at 23:53
  • Read about [Bezout Identity](https://en.wikipedia.org/wiki/B%C3%A9zout%27s_identity) – Basile Starynkevitch Nov 02 '14 at 08:08

1 Answers1

2

My guess is that your data is not evenly distributed in a. Consider the concatenation of a and b as your hash code:

b31b30...b1b0a31a30...a1a0, where ai, bi is the ith bit of a,b

Suppose you have a table with a million entries, your hash index is then

a9a8...a1a0 (as an integer)

Worse, suppose a only ever ranges from 1 to 100. Then you have even less dependence on the higher order bits of a.

As you can see, if your hash table doesn't have at least 4 billion entries, your hashcode has no dependence on b at all, and hash(x, a) will collide for all x.

Ben
  • 2,065
  • 1
  • 13
  • 19
  • That is enlightening, thanks. This is not on the scope of the question so feel free not to answer it, but what hashing function would you recommend on this case? – MaiaVictor Oct 17 '14 at 21:53