33

Possible Duplicate:
Why should hash functions use a prime number modulus?

Why is it necessary for a hash table's (the data structure) size to be a prime?

From what I understand, it assures a more even distribution but is there any other reason?

Community
  • 1
  • 1
Olivier Lalonde
  • 19,423
  • 28
  • 76
  • 91
  • 3
    This is a duplicate of [Why should hash functions use a prime number modulus?](http://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus) - the first link in the "Related" section of the sidebar - and I think the [accepted answer](http://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus/1147232#1147232) is very good. – Matthew Slattery Oct 20 '10 at 22:34
  • You should accept an answer. – jds Oct 17 '17 at 01:18
  • I just noticed this was marked as a duplicate. That's unfortunate. These are two related but separate questions. This particular question is about usage of prime in hash table capacity. The other is about usage of prime in calculation of an appropriate has value. They're related to each other, but not duplicate. – Samuel Neff Aug 30 '19 at 17:13

1 Answers1

42

The only reason is to avoid clustering of values into a small number of buckets (yes, distribution). A more even distributed hashtable will perform more consistently.

from http://srinvis.blogspot.com/2006/07/hash-table-lengths-and-prime-numbers.html

If suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering

  1. Make sure that you don't generate too many hashCodes that are multiples of another hashCode like in {x, 2x, 3x, 4x, 5x, 6x...}.But this may be kind of difficult if your hashTable is supposed to have millions of entries.

  2. Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.

Update: (from original answer author)

This answer is correct for a common implementation of a hash table, including the Java implementation of the original Hashtable as well as the current implementation of .NET's Dictionary.

Both the answer and the supposition that capacity should be prime are not accurate for Java's HashMap though. The implementation of a HashMap is very different and utilizes a table with a size of base 2 to store buckets and uses n-1 & hash to calculate which bucket to use as opposed to the more traditional hash % n formula.

Java's HashMap will force the actual used capacity to be the next largest base 2 number above the requested capacity.

Compare Hashtable:

int index = (hash & 0x7FFFFFFF) % tab.length

https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/java/util/Hashtable.java#L364

To HashMap:

first = tab[(n - 1) & hash]

https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/java/util/HashMap.java#L569

Community
  • 1
  • 1
Samuel Neff
  • 73,278
  • 17
  • 138
  • 182
  • 1
    Than I guess my understanding was right: Avoid clustering <=> Get a better distribution. Right? Thanks for the reference. – Olivier Lalonde Oct 20 '10 at 17:07
  • 6
    @Olivier Lalonde, if this answered your question please mark it as the answer. – Samuel Neff Jan 29 '11 at 20:34
  • `then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this)` How to verify/derive this? – tonix Jun 17 '21 at 09:40
  • @tonix that's a quote, but I assume you can simulate adding a ton of items to a hashtable simply by creating an array of integers representing buckets, loop through a large number of items, add each to the appropriate array element. A well distributed hash table, as described above, would have similar numbers in each array element. A poorly distributed one such as using a non-prime in the first example, will show spikes in some indexes. – Samuel Neff Jun 17 '21 at 13:31
  • @SamuelNeff Thank you for your reply Samuel, but I was looking for a way to prove it. – tonix Jun 17 '21 at 16:15