What does "distribution of the hash function" mean?

Question

While reading the documentation on MSDN for Object.GetHashCode method I came across phrases like the hash function should provide random or useful distribution in a hash table. What does this distribution means with regards to hash function or hash table?

Roughly: The hash values should be "spread randomly across their domain without an apparent pattern" (e.g. minimal clumping and maximum spread when viewed visually). Many hash implementations will *rehash* the hash to reduce the chance of clumping "appearing" when put into buckets. — , Apr 06 '12 at 06:15

score 13 · Answer 1 · answered Apr 06 '12 at 06:18

A hash function produces a 32 bit integer for the purpose of "balancing" a hash table. Suppose your table has a hundred "buckets", and you put items in the table in a bucket based on the bottom two decimal digits of the hash function.

Now suppose the hash function always produces numbers that are even multiples of a hundred. Every item is going to go into the same bucket, and the hash table will be unbalanced. That would be a bad hash function.

A good hash algorithm produces a roughly even distribution no matter how many buckets you have and no matter how you extract the bucket number from the hash.

Ethan Brown · Answer 2 · 2012-04-06T06:46:50.997

For hash tables to function with maximum efficacy, hash values should be as unique as possible to prevent collisions. For example, let's consider an extremely naïve hash function: let's say your objects are first and last names, and for your hash value, you choose the initials. So Ginger Rodgers' hash value is GR and Fred Astaire's hash value is FA. So far so good, but what happens when Frank Allen comes along with a hash value of FA? Now you have a collision between Fred Astaire and Frank Allen, and the hash table implementation has to handle this as a special case, which reduces efficiency.

The best hash functions take the input space (Fred Astaire), and produce a random value is (ideally) unique to the input space. As long as the size of your hash is smaller than the size of your data, there's no way to completely avoid collisions, but they should be minimized by carefully choosing the hash algorithm.

As pointed out by Eric below, hash algoirthms to balance hash tables have to be very fast, so you have to strike a balance between speed and collisions. You can study cryptographic hash algorithms like SHA-1 (http://en.wikipedia.org/wiki/SHA-1) to understand the complexities in generating unique hashes, but hash algorithms for balancing hash tables need to be as quick as possible.

You are doing great right up to your last paragraph. The requirements of cryptographic hash functions and the requirements of hash functions for balancing hash tables are very, very different and you should not conflate the two. You should never use an algorithm like SHA1 for hash table balancing; remember, the point of a hash table balancing algorithm is that *it is a performance optimization*, so don't go using a *slow and complicated* hashing algorithm! — Eric Lippert, Apr 06 '12 at 06:40
Good point, Eric. I was just trying to point out a hash algorithm that does a very good job at avoiding collisions. I'll update my answer accordingly. — Ethan Brown, Apr 06 '12 at 06:42
One might choose to hash a 32bit integer by just returning the 32bit integer. Great for hash table balancing, awful for cryptographic hashing. I would recommend against studying cryptographic hash algorithms in order to understand hash-table balancing hash functions. — Brian, Apr 09 '12 at 15:36

What does "distribution of the hash function" mean?

2 Answers2