2

This is my first question on stackflow. As you can tell, I am new to learning algorithms and data structure.

When using the division method for create a hash function (i.e. h(k) = k mod m), one is advised (e.g. by CLRS) to use a prime number not too close to a power of 2 for the divisor m. Could someone kindly explain to me why a choice of m to be a composite number is bad?

user1234310
  • 321
  • 1
  • 3
  • 9
  • possible duplicate of [Why should hash functions use a prime number modulus?](http://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus) – interjay Feb 26 '12 at 20:30

2 Answers2

14

Consider the case if m is even and all the k values are all even. Then, the hash values will also all be even.

For example, consider the case m=6 hashing even values:

Input values:  0, 2, 4, 6, 8, 10, 12, 14, 16, ...
Hash values:   0, 2, 4, 0, 2,  4,  0,  2,  4, ...

If you use these hash values as indices into a table, then half of the table will be unused. On the other hand, if m is a prime, you will get an even distribution of the hash values, even if the input values all have a common factor.

Consider the same input values, but with m=7:

Input values:  0, 2, 4, 6, 8, 10, 12, 14, 16, ...
Hash values:   0, 2, 4, 6, 1,  3,  5,  0,  2, ...

Despite the fact that the input values are all even, the hash values are still uniformly distributed over [0..6].

So to summarize, if m is prime, then you'll still get an even distribution of hash values even if all input values are divisible a common prime factor (other than m).

Igor ostrovsky
  • 7,282
  • 2
  • 29
  • 28
  • What you call input values are in fact the result of a bad hash function, with insufficient entropy in the lower bits. The "problem" could also be cured by choosing a better hash function. – wildplasser Feb 26 '12 at 22:20
  • In the context of OP's question, those **are** the **input values**. The OP writes, "When using the division method for create a hash function (i.e. h(k) = k mod m), [...]". When I say **input values**, I am referring to the values of *k* that you feed into *h(k)*, calculating *k mod m*. – Igor ostrovsky Feb 26 '12 at 22:31
  • Though you are formally right, the problem remains: the **input values** are not evenly distributed. (in the example: they are all even). If they are allowed to be unevenly distributed they just as well could all be a multiple of the table size. Another way to tackle this kind of problem is to multiply them all by an odd (relatively prime wrt m) number before performing the modulus. That will allow the table size to be *any* number, including powers of two. – wildplasser Feb 26 '12 at 22:56
  • Your suggestion doesn't work. If you have even integers (0, 2, 4, 6, ...) and multiply them by 3, they will still be even integers: (0, 6, 12, 18). And yes, in the real world the input values are not necessarily uniformly distributed. But understanding what happens when the values **are** uniformly distributed is the first step to understanding the more general case. So, that's what I tried focusing on in my answer. – Igor ostrovsky Feb 26 '12 at 23:16
  • Oops, you are right there of course. Should have been some real hash function applied to the input values instead. Or relying on the final modulus prime... – wildplasser Feb 26 '12 at 23:58
2

If you know the hash function, then you can always generate the perfect set of input data which will make the hash function behave miserably. There is no "universal best" hash function. But there is always a best set of input data and a worst set of input data.

A good hash function is expected to map any subset of a large set X into a smaller output set Y, with a minimal and fair distribution of collisions across the set Y.

As a corollary, there is no way to predict a hash function will be good without any knowledge of the data set for which the function will qualify as a "good one".

An advice about using a prime number rather than a composite number, without knowledge of the values to hash, is not better than claiming 87654321 is the best modulus for hashing.

If you want to hash prime numbers, or Fibonacci numbers, then advices about using a prime modulus, or composite modulus, or a power of 2, are irrelevant.

If you want to hash composite numbers pairwise co-prime, then a composite modulus co-prime to each element of the input set is a candidate for a "good" choice. A prime modulus larger than the largest factor of all the elements of the input set is another "good" choice.

If your input set is a sparse set of integers within an interval, with a Gaussian distribution of the intervals between the numbers, then the "best" choice of a modulus is a number which minimize the occurrences of gcd(modulus, input data) != 1. This is more likely to occur with the choice of a prime number as modulus.

This is the reason to suggest to use prime numbers.

Pierre
  • 21
  • 1