If our set of keys are integers then this means that the prime number
needs to be of the next bigger data type e.g. long.
That is not a problem. Sometimes it is necessary otherwise the hash family cannot be universal. See below for more information.
But eventually whatever we get as the hash would need to be
down-casted to an int
to index the hash table.
Doesn't this down-casting affect the quality of the Universal Hashing
(I am referring to the distribution of the keys over the buckets)
somehow?
The answer is no. I will try to explain.
Whether p
has another data type or not is not important for the hash family to be universal. Important is that p
is equal or larger than u
(the maximum integer of the universe of integers). It is important that p
is big enough (i.e. >= u
).
A hash family is universal when the collision probability is equal or
smaller than 1/m
.
So the idea is to hold that constraint.
The value of p
, in theory, can be as big as a long
or more. It just needs to be an integer and prime.
u
is the size of the domain/universe (or the number of keys). Given the universe U = {0, ..., u-1}
, u
denotes the size |U|
.
m
is the number of bins or buckets
p
is a prime which must be equal or greater than n
- the hash family is defined as
H = {h(a,b)(x)}
with h(a,b)(x) = ((a * x + b) mod p) mod m
. Note that a
and b
are randomly chosen integers (from all possible integers, so theoretically can be larger than p
) modulo a prime p
(which can make them either smaller or larger than m
, the number of bins/buckets); but here too the data type (domain of values does not matter). See Hashing integers on Wikipedia for notation.
- Follow the proof on Wikipedia and you conclude that the collision probability is
_p/m_ * 1/(p-1)
(the underscores mean to truncate the decimals). For p >> m
(p
considerably bigger than m
) the probability tends to 1/m
(but this does not mean that the probability would be better the larger p
is).
In other terms answering your question: p
being a bigger data type is not a problem here and can be even required. p
has to be equal or greater than u
and a
and b
have to be randomly chosen integers modulo p
, no matter the number of buckets m
. With these constraints you can construct a universal hash family.
Maybe a mathematical example could help
- Let U be the universe of integers that correspond to
unsigned char
(in C for example). Then U = {0, ..., 255}
Let p
be (next possible) prime equal or greater than 256
. Note that p
can be any of these types (short
, int
, long
be it signed or unsigned). The point is that the data type does not play a role (In programming the type mainly denotes a domain of values.). Whether 257
is short
, int
or long
doesn't really matter here for the sake of correctness of the mathematical proof. Also we could have chosen a larger p
(i.e. a bigger data type); this does not change the proof's correctness.
- The next possible prime number would be
257
.
- We say we have
25
buckets, i.e. m = 25
. This means a hash family would be universal if the collision probability is equal or less than 1/25
, i.e. approximately 0.04
.
- Put in the values for
_p/m_ * 1/(p-1)
: _257/25_ * 1/256 = 10/256 = 0.0390625
which is smaller than 0.04
. It is a universal hash family with the chosen parameters.
We could have chosen m = u = 256
buckets. Then we would have a collision probability of 0.003891050584
, which is smaller than 1/256 = 0,00390625
. Hash family is still universal.
Let's try with m
being bigger than p
, e.g. m = 300
. Collision probability is 0, which is smaller than 1/300 ~= 0.003333333333
. Trivial, we had more buckets than keys. Still universal, no collisions.
Implementation detail example
We have the following: