3

I have around 5,000,000 objects stored in a Dictionary<MyKey, MyValue>.

MyKey is a struct that packs each component of my key (5 different numbers) in the right-most 44 bits of an Int64 (ulong).

Since the ulong will always start with 20 zero-bits, my gut feeling is that returning the native Int64.GetHashCode() implementation is likely to collide more often, than if the hash code implementation only considers the 44 bits that are actually in use (although mathematically, I wouldn't know where to begin to prove that theory).

This increases the number of calls to .Equals() and makes dictionary lookups slower.

The .NET implementation of Int64.GetHashCode() looks like this:

public override int GetHashCode()
{
    return (int)this ^ (int)(this >> 32);
}

How would I best implement GetHashCode()?

Nissa
  • 4,636
  • 8
  • 29
  • 37
Jake
  • 117
  • 6
  • 3
    Is slow dictionary lookup actually an observed problem (bottleneck) in your system? If not, I wouldn't worry about it. – Matt Burland Sep 25 '14 at 18:13
  • 1
    I don't think there's anything wrong with the native implementation. It's XORing the two 32-bit chunks of the 64-bit number together. 32 bits gives four billion possibilities, yet you say you're storing only 5 million objects. – David Crowell Sep 25 '14 at 18:14
  • The low 32 bits are all used, so the fact that the high 20 bits aren't used doesn't seem to matter as much once the xor is done. – recursive Sep 25 '14 at 18:17
  • And I think your premise is wrong. If you are only using `44` bits, you could (by analogy with the `Int64.GetHashCode` implementation) do something like `(int)this ^ (int)(this >> 22)`, but that actually gives you *fewer* values and, therefore, *more* collisions. – Matt Burland Sep 25 '14 at 18:22
  • Despite what the others are saying, my gut feeling matches yours. How about taking your 44 bits and place them in two uints, one with the top 32 of the 44 bits and the other with the bottom 32 of the 44 bits. (Yes, 12 of your 44 bits will be present in both ints, but that is less of a de-randomizing factor than the 20 zero bits, says my gut feeling.) So then take hash codes of the two ints, and then XOR them together. – RenniePet Sep 25 '14 at 21:31
  • Two corrections to previous comment: The overlap is 20 bits, not 12. And my expectation of fewer collisions by using a "smarter" hashing seems to be wrong - see the answer I've posted. – RenniePet Sep 26 '14 at 00:19

2 Answers2

4

I couldn't begin to suggest a "best" way to hash 44-bit numbers. But, I can suggest a way to compare it to the 64-bit hash algorithm.

One way to do this is to simply check how many collisions you get for a set of numbers (as suggested by McKenzie et al in Selecting a Hashing Algorithm) Unless you're going to test all possible values of your set, you'll need to judge whether the # of collisions you get is acceptable. This could be done in code with something like:

var rand = new Random(42);
var dict64 = new Dictionary<int, int>();
var dict44 = new Dictionary<int, int>();
for (int i = 0; i < 100000; ++i)
{
    // get value between 0 and 0xfffffffffff (max 44-bit value)
    var value44 = (ulong)(rand.NextDouble() * 0x0FFFFFFFFFFF);
    var value64 = (ulong)(rand.NextDouble() * ulong.MaxValue);
    var hash64 = value64.GetHashCode();
    var hash44 = (int)value44 ^ (int)(value44>> 32);
    if (!dict64.ContainsValue(hash64))
    {
        dict64.Add(hash64,hash64);
    }
    if (!dict44.ContainsValue(hash44))
    {
        dict44.Add(hash44, hash44);
    }
}
Trace.WriteLine(string.Format("64-bit hash: {0}, 64-bit hash with 44-bit numbers {1}", dict64.Count, dict44.Count));

In other words, consistently generate 100,000 random 64-bit values and 100,000 random 44-bit values, perform a hash on each and keep track of unique values.

In my test this generated 99998 unique values for 44-bit numbers and 99997 unique values for 64-bit numbers. So, that's one less collision for 44-bit numbers over 64-bit numbers. I would expect less collisions with 44-bit numbers simply because you have less possible inputs.

I'm not going to tell you the 64-bit hash method is "best" for 44-bit; you'll have to decide if these results mean it's good for your circumstances.

Ideally you should be testing with realistic values that your application is likely to generate. Given those will all be 44-bit values, it's hard to compare that to the collisions ulong.GetHashCode() produces (i.e. you'd have identical results). If random values based on a constant seed isn't good enough, modify the code with something better.

While things might not "feel" right, science suggests there's no point in changing something without reproducible tests that prove a change is necessary.

Peter Ritchie
  • 35,463
  • 9
  • 80
  • 98
-2

Here's my attempt to answer this question, which I'm posting despite the fact that the answer is the opposite of what I was expecting. (Although I may have made a mistake somewhere - I almost hope so, and am open to criticism regarding my test technique.)

  // Number of Dictionary hash buckets found here:
  // http://stackoverflow.com/questions/24366444/how-many-hash-buckets-does-a-net-dictionary-use
  const int CNumberHashBuckets = 4999559;

  static void Main(string[] args)
  {
     Random randomNumberGenerator = new Random();

     int[] dictionaryBuckets1 = new int[CNumberHashBuckets];
     int[] dictionaryBuckets2 = new int[CNumberHashBuckets];

     for (int i = 0; i < 5000000; i++)
     {
        ulong randomKey = (ulong)(randomNumberGenerator.NextDouble() * 0x0FFFFFFFFFFF);

        int simpleHash = randomKey.GetHashCode();
        BumpHashBucket(dictionaryBuckets1, simpleHash);

        int superHash = ((int)(randomKey >> 12)).GetHashCode() ^ ((int)randomKey).GetHashCode();
        BumpHashBucket(dictionaryBuckets2, superHash);
     }

     int collisions1 = ComputeCollisions(dictionaryBuckets1);
     int collisions2 = ComputeCollisions(dictionaryBuckets2);
  }

  private static void BumpHashBucket(int[] dictionaryBuckets, int hashedKey)
  {
     int bucketIndex = (int)((uint)hashedKey % CNumberHashBuckets);
     dictionaryBuckets[bucketIndex]++;
  }

  private static int ComputeCollisions(int[] dictionaryBuckets)
  {
     int i = 0;
     foreach (int dictionaryBucket in dictionaryBuckets)
        i += Math.Max(dictionaryBucket - 1, 0);
     return i;
  }

I try to simulate how the processing done by Dictionary will work. The OP says he has "around 5,000,000" objects in a Dictionary, and according to the referenced source there will be either 4999559 or 5999471 "buckets" in the Dictionary.

Then I generate 5,000,000 random 44-bit keys to simulate the OP's Dictionary entries, and for each key I hash it two different ways: the simple ulong.GetHashCode() and an alternative way that I suggested in a comment. Then I turn each hash code into a bucket index using modulo - I assume that's how it is done by Dictionary. This is used to increment the pseudo buckets as a way of computing the number of collisions.

Unfortunately (for me) the results are not as I was hoping. With 4999559 buckets the simulation typically indicates around 1.8 million collisions, with my "super hash" technique actually having a few (around 0.01%) MORE collisions. With 5999471 buckets there are typically around 1.6 million collisions, and my so-called super hash gives maybe 0.1% fewer collisions.

So my "gut feeling" was wrong, and there seems to be no justification for trying to find a better hash code technique.

RenniePet
  • 11,420
  • 7
  • 80
  • 106