I have read several answers here on SO and on the web about choosing a good hash table length and that it should be a prime to reduce collisions and to uniformly distribute the keys across the hash table.
Though there are a lot of answers, I couldn't find a satisfying proof, and I didn't understand the explanations I found.
So if we have a key k
and a hash table of length n
and we do k % n = i
to find the index i
of a bucket in a hash table, we say that n
should be a prime in order to minimize the number of collisions and to better distribute the keys across the hash table.
But why? Here is my attempt to prove it. It is going to be quite long and a bit pedantic, but please bear with me and try to read till the end.
I will start by making the following assumptions:
- For every key
k
of the set of keysK
, we can have ak
which is either even or odd. A key is an integer, either even (k = 2x
) or odd (k = 2x + 1
). - For every
n
we may choose,n
also can either be even (n = 2y
) or odd (n = 2y + 1
). - If we add an even number to another even number, we get an even number (
2x + 2y = 2(x + y)
). Likewise, if we add an odd number to another odd number, we still get an even number ((2x + 1) + (2y + 1) = 2x + 1 + 2y + 1 = 2x + 2y + 2 = 2(x + y + 1)
). - If we add an odd number to an even number (same as adding an even number to an odd one), we always get an odd number (
(2x + 1) + 2y = 2x + 1 + 2y = 2(x + y) + 1
).
First of all, let's try to think about using an n
which is not a prime, so maybe we will find out that these numbers are not good enough to be used as the length of a hash table (assuming that the keys share some patterns, e.g. like being all even or all odd).
- Let's assume that
n
is even, i.e.n = 2y
. In this case we have 2 scenarios: our keysk
ofK
can be even (1.1.) or odd (1.2.).
1.1. n = 2y
is even, keys are even k = 2x
For k = 2x
and n = 2y
, we have: k % n = 2x % 2y = i
.
In this case we can say that if both the key k
and hash table length n
are even,
then i
is also going to always be even.
Why? Because if we take the quotient by the integer division k // n = 2x // 2y = q
, we get a quotient q
such that:
k = 2x = (n * q) + i = (2y * q) + i = 2yq + i
Since 2yq
(2y * q
) is an even number, in order to satisfy 2x = 2yq + i
the remainder i
is always going to be even because 2x
is even (even + even = even
). If i
were odd, we would get an odd number (even + odd = odd
), but again 2x
is even.
This leads to the following issue if we choose n
to be even: if all our k
s are even, then they will always end up in a bucket at an even index, increasing the number of collisions and clustering because only half n / 2
of the length of our hash table (only the even indices) will be occupied.
Therefore it's not a good idea to use an even number for n
if all our k
s or the majority of our k
s are going to be even.
1.2. n = 2y
is even, keys are odd k = 2x + 1
For k = 2x + 1
and n = 2y
, we have: k % n = (2x + 1) % 2y = i
.
Likewise, in this case if all of our k
s (or the majority of them) are going to be odd, we end up in this situation:
k = 2x + 1 = (n * q) + i = (2y * q) + i = 2yq + i
Since 2yq
is even, in order to get an odd k = 2x + 1
, i
is always going to be odd (even + odd = odd
).
Again, choosing an even n
as the hash table length is a bad idea even if all or the majority of our k
s are odd, because we will end up with only the odd indices (buckets) being occupied.
So let's try with an n
which is not an even number, i.e. an odd n = 2y + 1
.
- Let's assume that
n
is odd, i.e.n = 2y + 1
. We still have even (2.1.) and odd (2.2.) keys (k
ofK
).
2.1. n = 2y + 1
is odd, keys are even k = 2x
Here we have:
k = 2x = (n * q) + i = ((2y + 1) * q) + i = (2yq + q) + i = 2yq + q + i
We know that 2yq
is even, so in order to get k = 2x
which is even as well we need q + i
to also be even.
When can q + i
be even? Only in these 2 cases:
q -> even
,i -> even
,even + even = even
q -> odd
,i -> odd
,odd + odd = even
If either q
or i
is even while the other one is odd, we will get an odd q + i
, and consequently an odd 2yq + (q + i)
, but we have k = 2x
which is even, so either both q
and i
are even or they're both odd.
In this case we can see that for an odd n = 2y + 1
, i
can either be even or odd, which is good because it means that now we will use both even and odd bucket indices of our hash table and not only the even or only the odd ones.
By the way, it turns out that all primes p : p > 2
are odd numbers, so at least for now we can say that choosing a prime could be a good idea because a prime greater than 2 is always odd.
2.2. n = 2y + 1
is odd, keys are odd k = 2x + 1
Similarly here:
k = 2x + 1 = (n * q) + i = ((2y + 1) * q) + i = 2yq + q + i = 2yq + (q + i)
In order to get an odd k = 2x + 1
we need (q + i)
to be odd (2yq
is even), and this happens only in these 2 cases:
q -> even
,i -> odd
,even + odd = odd
q -> odd
,i -> even
,odd + even = odd
Again, we prove that an odd number is a better choice for n
as this way we have the chance that both even and odd bucket's indices i
are going to be occupied.
Now, I got stuck here. Is there a connection between this proof and prime numbers and how can I continue this proof to conclude that a prime number p
would be an even better choice than a generic odd number with a similar reasoning?
EDIT:
So I tried to reason about it a bit further. This is what I came up with:
3. Using a generic odd n
sharing a common factor f
with k
We can say that for any factor f
which is shared across k
(k = f * x = fx
) and n
(n = f * y = fy
), we end up with an i = k % n
also sharing that common factor f
. Why?
Again, if we try to compute k
:
k = fx = (n * q) + i = (fy * q) + i = fyq + i
Then:
k = fx = fyq + i
Can only be satisfied if and only if i
also shares f
as one of its factors, e.g. i = f * g = fg
:
k = fx = fyq + fg = f(yq + g)
Leading to yq + g = x
.
This means that if both k
and n
share a common factor, then the result of the modulo i
will also have that common factor and therefore i
will always be a multiple of that common factor, e.g. for k
of K = {12, 15, 33, 96, 165, 336}
and n = 9
(an odd number, not a prime):
k | k % n
---------------------------
12 | 12 % 9 = 3
15 | 15 % 9 = 6
33 | 33 % 9 = 6
96 | 96 % 9 = 6
165 | 165 % 9 = 3
336 | 336 % 9 = 3
Both k
and n
always share a common factor (3
in this case).
This leads to i = k % n
also being a multiple of 3
and therefore, again in such scenarios the hash table's bucket indices being used will only be those that are multiples of the common factor 3
.
So while an odd number for n
is definitely better than an even one (as explained at 2.1. and 2.2), we still may have unwanted patterns in numbers when k
and n
both share a common factor f
.
So, if we make n
a prime (n = p
), we will certainly avoid that n
shares that common factor f
with k
(provided that f != p
), because a prime p
can only have two factors: 1 and itself. So...
4. Using a prime for n
If n
is a prime (n = p
), we end up with:
k = fx = (q * p) + i = qp + i
Then:
k = fx = qp + i
Implies that the quotient q
resulting from the integer division k // n
can either share the common factor f
or not, i.e.:
q = fz
Or:
q = z
In the first case (q = fz
) we have:
k = fx = (q * p) + i = (fz * p) + i = fzp + i
So i
ends up sharing the common factor f
as well, e.g. i = fg
:
k = fx = (q * p) + i = (fz * p) + i = fzp + i = fzp + fg = f(zp + g)
Such that zp + g = x
.
And in the second case (q = z
), we have:
k = fx = (q * p) + i = (z * p) + i = zp + i = zp + i
i.e. in this second case, i
won't have f
as one of its factors, as zp
doesn't have f
among its factors too.
So when using a prime for n
, the benefit is that the result for i = k % n
can either share a common factor f
with k
or not share it at all, e.g. for k
of K = {56, 64, 72, 80, 88, 96}
and n = p = 17
:
k | k % n
---------------------------
56 | 56 % 17 = 5
64 | 64 % 17 = 13
72 | 72 % 17 = 4 ---> Common factor f = 4 of k and i
80 | 80 % 17 = 12 ---> Common factor f = 4 of k and i
88 | 88 % 17 = 3
96 | 96 % 17 = 11
In this case, all k
s share a common factor f = 4
, but only i = 72 % 17 = 4
and i = 80 % 17 = 12
both have k
and i
sharing that common factor f
:
72 % 17 = 4 -> (18 * 4) % 17 = (4 * 1)
80 % 17 = 12 -> (20 * 4) % 17 = (4 * 3)
Also, if we take the previous example, for k
of K = {12, 15, 33, 96, 165, 336}
and we use the prime 17
for n
instead of 9
, we get:
k | k % n
---------------------------
12 | 12 % 17 = 12
15 | 15 % 17 = 15
33 | 33 % 17 = 16
96 | 96 % 17 = 11
165 | 165 % 17 = 12
336 | 336 % 17 = 13
Even here, we see that the common factor f = 3
in this case is shared between both k
and n
only in these 3 cases:
12 % 17 = 12 -> (4 * 3) % 17 = (4 * 3)
15 % 17 = 15 -> (5 * 3) % 17 = (5 * 3)
165 % 17 = 12 -> (55 * 3) % 17 = (4 * 3)
This way, using a prime, the probability for a collision has decreased, and we can distribute the data across the hash table better.
Now, what happens if even k
is a prime, or at least a multiple of a prime? I think that in this case the distribution along the hash table would be even better, because there won't be any common factors between k
and n
if they are both primes or if k
is a multiple of a prime, provided that k
is not a multiple of the prime n
.
This is my conclusion why a prime is better suited for the length of a hash table.
Would appreciate to receive your feedback and thoughts regarding my way of understanding this topic.
Thank you.