2

I have read several answers here on SO and on the web about choosing a good hash table length and that it should be a prime to reduce collisions and to uniformly distribute the keys across the hash table.

Though there are a lot of answers, I couldn't find a satisfying proof, and I didn't understand the explanations I found.

So if we have a key k and a hash table of length n and we do k % n = i to find the index i of a bucket in a hash table, we say that n should be a prime in order to minimize the number of collisions and to better distribute the keys across the hash table.

But why? Here is my attempt to prove it. It is going to be quite long and a bit pedantic, but please bear with me and try to read till the end.

I will start by making the following assumptions:

  • For every key k of the set of keys K, we can have a k which is either even or odd. A key is an integer, either even (k = 2x) or odd (k = 2x + 1).
  • For every n we may choose, n also can either be even (n = 2y) or odd (n = 2y + 1).
  • If we add an even number to another even number, we get an even number (2x + 2y = 2(x + y)). Likewise, if we add an odd number to another odd number, we still get an even number ((2x + 1) + (2y + 1) = 2x + 1 + 2y + 1 = 2x + 2y + 2 = 2(x + y + 1)).
  • If we add an odd number to an even number (same as adding an even number to an odd one), we always get an odd number ((2x + 1) + 2y = 2x + 1 + 2y = 2(x + y) + 1).

First of all, let's try to think about using an n which is not a prime, so maybe we will find out that these numbers are not good enough to be used as the length of a hash table (assuming that the keys share some patterns, e.g. like being all even or all odd).

  1. Let's assume that n is even, i.e. n = 2y. In this case we have 2 scenarios: our keys k of K can be even (1.1.) or odd (1.2.).

1.1. n = 2y is even, keys are even k = 2x

For k = 2x and n = 2y, we have: k % n = 2x % 2y = i.

In this case we can say that if both the key k and hash table length n are even, then i is also going to always be even. Why? Because if we take the quotient by the integer division k // n = 2x // 2y = q, we get a quotient q such that:

k = 2x = (n * q) + i = (2y * q) + i = 2yq + i

Since 2yq (2y * q) is an even number, in order to satisfy 2x = 2yq + i the remainder i is always going to be even because 2x is even (even + even = even). If i were odd, we would get an odd number (even + odd = odd), but again 2x is even.

This leads to the following issue if we choose n to be even: if all our ks are even, then they will always end up in a bucket at an even index, increasing the number of collisions and clustering because only half n / 2 of the length of our hash table (only the even indices) will be occupied.

Therefore it's not a good idea to use an even number for n if all our ks or the majority of our ks are going to be even.

1.2. n = 2y is even, keys are odd k = 2x + 1

For k = 2x + 1 and n = 2y, we have: k % n = (2x + 1) % 2y = i. Likewise, in this case if all of our ks (or the majority of them) are going to be odd, we end up in this situation:

k = 2x + 1 = (n * q) + i = (2y * q) + i = 2yq + i

Since 2yq is even, in order to get an odd k = 2x + 1, i is always going to be odd (even + odd = odd).

Again, choosing an even n as the hash table length is a bad idea even if all or the majority of our ks are odd, because we will end up with only the odd indices (buckets) being occupied.

So let's try with an n which is not an even number, i.e. an odd n = 2y + 1.

  1. Let's assume that n is odd, i.e. n = 2y + 1. We still have even (2.1.) and odd (2.2.) keys (k of K).

2.1. n = 2y + 1 is odd, keys are even k = 2x

Here we have:

k = 2x = (n * q) + i = ((2y + 1) * q) + i = (2yq + q) + i = 2yq + q + i

We know that 2yq is even, so in order to get k = 2x which is even as well we need q + i to also be even. When can q + i be even? Only in these 2 cases:

  1. q -> even, i -> even, even + even = even
  2. q -> odd, i -> odd, odd + odd = even

If either q or i is even while the other one is odd, we will get an odd q + i, and consequently an odd 2yq + (q + i), but we have k = 2x which is even, so either both q and i are even or they're both odd.

In this case we can see that for an odd n = 2y + 1, i can either be even or odd, which is good because it means that now we will use both even and odd bucket indices of our hash table and not only the even or only the odd ones.

By the way, it turns out that all primes p : p > 2 are odd numbers, so at least for now we can say that choosing a prime could be a good idea because a prime greater than 2 is always odd.

2.2. n = 2y + 1 is odd, keys are odd k = 2x + 1

Similarly here:

k = 2x + 1 = (n * q) + i = ((2y + 1) * q) + i = 2yq + q + i = 2yq + (q + i)

In order to get an odd k = 2x + 1 we need (q + i) to be odd (2yq is even), and this happens only in these 2 cases:

  1. q -> even, i -> odd, even + odd = odd
  2. q -> odd, i -> even, odd + even = odd

Again, we prove that an odd number is a better choice for n as this way we have the chance that both even and odd bucket's indices i are going to be occupied.

Now, I got stuck here. Is there a connection between this proof and prime numbers and how can I continue this proof to conclude that a prime number p would be an even better choice than a generic odd number with a similar reasoning?

EDIT:

So I tried to reason about it a bit further. This is what I came up with:

3. Using a generic odd n sharing a common factor f with k

We can say that for any factor f which is shared across k (k = f * x = fx) and n (n = f * y = fy), we end up with an i = k % n also sharing that common factor f. Why?

Again, if we try to compute k:

k = fx = (n * q) + i = (fy * q) + i = fyq + i

Then:

k = fx = fyq + i

Can only be satisfied if and only if i also shares f as one of its factors, e.g. i = f * g = fg:

k = fx = fyq + fg = f(yq + g)

Leading to yq + g = x.

This means that if both k and n share a common factor, then the result of the modulo i will also have that common factor and therefore i will always be a multiple of that common factor, e.g. for k of K = {12, 15, 33, 96, 165, 336} and n = 9 (an odd number, not a prime):

k    |  k % n
---------------------------
12   |  12 % 9 = 3
15   |  15 % 9 = 6
33   |  33 % 9 = 6
96   |  96 % 9 = 6
165  | 165 % 9 = 3
336  | 336 % 9 = 3

Both k and n always share a common factor (3 in this case). This leads to i = k % n also being a multiple of 3 and therefore, again in such scenarios the hash table's bucket indices being used will only be those that are multiples of the common factor 3.

So while an odd number for n is definitely better than an even one (as explained at 2.1. and 2.2), we still may have unwanted patterns in numbers when k and n both share a common factor f.

So, if we make n a prime (n = p), we will certainly avoid that n shares that common factor f with k (provided that f != p), because a prime p can only have two factors: 1 and itself. So...

4. Using a prime for n

If n is a prime (n = p), we end up with:

k = fx = (q * p) + i = qp + i

Then:

k = fx = qp + i

Implies that the quotient q resulting from the integer division k // n can either share the common factor f or not, i.e.:

  1. q = fz

Or:

  1. q = z

In the first case (q = fz) we have:

k = fx = (q * p) + i = (fz * p) + i = fzp + i

So i ends up sharing the common factor f as well, e.g. i = fg:

k = fx = (q * p) + i = (fz * p) + i = fzp + i = fzp + fg = f(zp + g)

Such that zp + g = x.

And in the second case (q = z), we have:

k = fx = (q * p) + i = (z * p) + i = zp + i = zp + i

i.e. in this second case, i won't have f as one of its factors, as zp doesn't have f among its factors too.

So when using a prime for n, the benefit is that the result for i = k % n can either share a common factor f with k or not share it at all, e.g. for k of K = {56, 64, 72, 80, 88, 96} and n = p = 17:

k    |  k % n
---------------------------
56   |  56 % 17 = 5
64   |  64 % 17 = 13
72   |  72 % 17 = 4 ---> Common factor f = 4 of k and i 
80   |  80 % 17 = 12 ---> Common factor f = 4 of k and i
88   |  88 % 17 = 3
96   |  96 % 17 = 11

In this case, all ks share a common factor f = 4, but only i = 72 % 17 = 4 and i = 80 % 17 = 12 both have k and i sharing that common factor f:

72 % 17 = 4 -> (18 * 4) % 17 = (4 * 1)
80 % 17 = 12 -> (20 * 4) % 17 = (4 * 3)

Also, if we take the previous example, for k of K = {12, 15, 33, 96, 165, 336} and we use the prime 17 for n instead of 9, we get:

k    |  k % n
---------------------------
12   |  12 % 17 = 12
15   |  15 % 17 = 15
33   |  33 % 17 = 16
96   |  96 % 17 = 11
165  | 165 % 17 = 12
336  | 336 % 17 = 13

Even here, we see that the common factor f = 3 in this case is shared between both k and n only in these 3 cases:

12 % 17 = 12 -> (4 * 3) % 17 = (4 * 3)
15 % 17 = 15 -> (5 * 3) % 17 = (5 * 3)
165 % 17 = 12 -> (55 * 3) % 17 = (4 * 3)

This way, using a prime, the probability for a collision has decreased, and we can distribute the data across the hash table better.

Now, what happens if even k is a prime, or at least a multiple of a prime? I think that in this case the distribution along the hash table would be even better, because there won't be any common factors between k and n if they are both primes or if k is a multiple of a prime, provided that k is not a multiple of the prime n.

This is my conclusion why a prime is better suited for the length of a hash table.

Would appreciate to receive your feedback and thoughts regarding my way of understanding this topic.

Thank you.

tonix
  • 6,671
  • 13
  • 75
  • 136
  • I’m voting to close this question because it's not about programming, it seems better suited for [Cryptography Stack Exchange](https://crypto.stackexchange.com/) however. – Alejandro Jun 17 '21 at 19:45
  • 5
    @Alejandro I wouldn't say that this question is related to cryptography at all. Have you read it all? I don't mention anything about cryptography. It's only about the idea behind using prime numbers for a hash table data structure and the proof that lies behind this reasoning. – tonix Jun 17 '21 at 19:47
  • Prime numbers and hash tables are a crypto concept :D. Even then, neither is a programming question (there isn't even code here). Mathematics might be a better match too. – Alejandro Jun 17 '21 at 20:34
  • You're just math-ing yourself in circles trying to prove something that isn't true. It's all about the distribution of the keys `k`. If the keys are approximately uniformly distributed integers then it makes no difference if n is prime or not: k % n will also be approximately uniformly distributed. In practice keys are not uniformly distributed, so n being a prime *may* result in better hash table utilization. – President James K. Polk Jun 17 '21 at 22:25
  • The use of a prime is generally in the formula used to generate k, not for the size of the table. – rossum Jun 18 '21 at 07:55
  • 1
    @Alejandro My question doesn't have anything to do with cryptography, it can be related, but it's about data structures (an hash table is a data structure). Check out these questions: https://stackoverflow.com/questions/18037909/using-a-prime-number-for-the-size-of-a-hashtable, https://stackoverflow.com/questions/730620/how-does-a-hash-table-work, https://stackoverflow.com/questions/7306316/b-tree-vs-hash-table . They don't have code. I think you agree with me that programming isn't about code, it's about thinking in the first place. Code is the concrete result of our thoughts. – tonix Jun 18 '21 at 09:21
  • @PresidentJamesK.Polk `If the keys are approximately uniformly distributed integers then it makes no difference if n is prime or not` I know that. `In practice keys are not uniformly distributed` I know that as well, that's why I am asking this question. `so n being a prime may result in better hash table utilization` Why? Can you prove it? Thank you. – tonix Jun 18 '21 at 09:25
  • 1
    @rossum I know that. For example, Java uses the prime `31` in it's `hashCode` implementations e.g. for `String`. That's not the point, though. Many books, articles, academic papers about hash tables advise to use a prime for the size of the table as well, otherwise I wouldn't have asked this question. – tonix Jun 18 '21 at 09:29
  • The 31 multiplier stems from K&R (33 can also be used) and evolved via Sun, IIRC. It is especially attractive because it can be implemented as shift+add, and because [0-9], [A-Z], and [a-z] are guaranteed to hash to different values, at least for ASCII. And I agree: using primes is often cargo-cult-programming. BTW: another factor is (avoiding) collision attacks. – wildplasser Jun 19 '21 at 12:05
  • Related question: https://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus – Tony Delroy Jun 19 '21 at 15:43
  • @wildplasser Do you think my proof makes sense? – tonix Jun 19 '21 at 19:51
  • 1
    In theory: yes. In practice: it does not matter. Normally the cardinality of the keyspace is larger than the range of the hash-space, so you'll have to deal with collisions anyway. Having a good *spread* (and no *obvious* collisions) is the first objective. – wildplasser Jun 19 '21 at 20:37
  • OK, thank you. But a prime seems to spread the keys better, doesn't it? – tonix Jun 19 '21 at 21:27
  • A prime *for what* ?. You have -more or less- proved that it doesnt matter fo r`n'. The *spread* is goverrned by the actual hash function, not by n. – wildplasser Jun 19 '21 at 23:59
  • 1
    But if you use a prime for `n` and your keys conform to some kind of pattern (e.g. all being multiples of a certain factor `f`), the chance of spreading them across different buckets increases. On the other hand, if `n` is not a prime, the chance of using the same buckets which are multiples of a common factor `f` increases. – tonix Jun 20 '21 at 13:15
  • The hash function plays a key role too. A good hash function would mitigate the issues of an `n` not being a prime. This is how I understood the whole concept and it makes sense to me now. – tonix Jun 20 '21 at 13:16

1 Answers1

0

When it comes to chaining hash tables, you pretty much have the answer, although it can be written in fewer words:

  • Data often has patterns. Memory addresses, for example, often have zero in the lower bits.
  • Many hash functions, especially the very popular polynomial hash functions, are built using only addition, subtraction, and multiplication. All of these operations have the property that the lowest n bits of the result depend only on the lowest n bits of the operands, so these hash functions have this property too.
  • If your table size is zero in the lowest n bits, and the data is all the same in the lowest n bits, and your hash function has the property mentioned above... then your hash table will only use one out of every 2n of its slots.

Sometimes we fix this problem by choosing an odd hash table size. Prime sizes are better, because each small factor of the table size causes similar problems with different arithmetic progressions in hash values.

Sometimes, though, we fix this problem by adding an additional hash step to the hash table itself -- an additional hash step that mixes all the bits of the hash together and prevents this kinds of problems. Java HashMap uses tables with size 2N, but does this extra mixing step to help cover all the slots.

That's for chaining hash tables.

For hash tables that use open addressing to resolve collisions, the choice of a prime table size is usually required to ensure that the probing scheme will eventually check all (or at least half of) the slots. This is required to guarantee that the table will work at least until it is (half) full.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • Thank you for your answer. Please, check my edit, I have added some additional thoughts and I think I came up with a proof based on my initial reasoning. Thank you! – tonix Jun 19 '21 at 11:37