0

So I need to create a dictionary with keys that are objects with a custom Equals() function. I discovered I need to override GetHashCode() too. I heard that for optimal performance you should have hash codes that don't collide, but that seems counter intuitive. I might be misunderstanding it, but it seems the entire point of using hash codes is to group items into buckets and if the hash codes never collide each bucket will only have 1 item which seems to defeat the purpose.

So should I intentionally make my hash codes collide occasionally? Performance is important. This will be a dictionary that will probably grow to multiple million items and I'll be doing lookups very often.

ForeverNoobie
  • 531
  • 5
  • 17

3 Answers3

2

The goal of a hash code is to give you an index into an array, each of which is a bucket that may contain zero, one, or more items. The performance of the lookup then is dependent on the number of elements in the bucket. The fewer the better, since once you're in the bucket, it's an O(n) search (where n is the number of elements in the bucket). Therefore, it's ideal if the hashcode prevents collisions as much as possible, allowing for the optimal O(1) time as much as possible.

Kirk Woll
  • 76,112
  • 22
  • 180
  • 195
1

Dictionaries store data in buckets but there isn't one bucket for each hashcode. The number of buckets is based on the capacity. Values are put into buckets based on the modulus of the hashcode and number of buckets.

Lets say you have a GetHashCode() method that produces these hash codes for five objects:

925
10641
14316
17213
28624

Hash codes should be spread out. So these look spread out, right? If we have 7 buckets, then we end up calculating the modulus of each which gives us:

1
1
1
0
1

So we end up with buckets:

0 - 1 item
1 - 4 items
2 - 0 items
3 - 0 items
4 - 0 items
5 - 0 items
6 - 0 items

oops, not so well spread out now.

This is not made up data. These are actual hash codes.

Here's a sample of how to generate a hash code from contained data (not the formula used for the above hash codes, a better one).

https://stackoverflow.com/a/263416/118703

Community
  • 1
  • 1
Samuel Neff
  • 73,278
  • 17
  • 138
  • 182
0

You must ensure that the following holds:

(GetHashCode(a) != GetHashCode(b)) => !Equals(a, b)

The reverse implication is identical in meaning:

Equals(a, b) => (GetHashCode(a) == GetHashCode(b))

Apart from that, generate as few collisions as possible. A collision is defined as:

(GetHashCode(a) == GetHashCode(b)) && !Equals(a, b)

A collision does not affect correctness, but performance. GetHashCode always returning zero would be correct for example, but slow.

usr
  • 168,620
  • 35
  • 240
  • 369