10

I am reading about Universal hashing on integers. The prerequisite and mandatory precondition seems to be that we choose a prime number p greater than the set of all possible keys.

I am not clear on this point.

If our set of keys are of type int then this means that the prime number needs to be of the next bigger data type e.g. long.

But eventually whatever we get as the hash would need to be down-casted to an int to index the hash table. Doesn't this down-casting affect the quality of the Universal Hashing (I am referring to the distribution of the keys over the buckets) somehow?

Ely
  • 10,860
  • 4
  • 43
  • 64
Cratylus
  • 52,998
  • 69
  • 209
  • 339

1 Answers1

6

If our set of keys are integers then this means that the prime number needs to be of the next bigger data type e.g. long.

That is not a problem. Sometimes it is necessary otherwise the hash family cannot be universal. See below for more information.

But eventually whatever we get as the hash would need to be down-casted to an int to index the hash table.

Doesn't this down-casting affect the quality of the Universal Hashing (I am referring to the distribution of the keys over the buckets) somehow?

The answer is no. I will try to explain.

Whether p has another data type or not is not important for the hash family to be universal. Important is that p is equal or larger than u (the maximum integer of the universe of integers). It is important that p is big enough (i.e. >= u).

A hash family is universal when the collision probability is equal or smaller than 1/m.

So the idea is to hold that constraint.

The value of p, in theory, can be as big as a long or more. It just needs to be an integer and prime.

  • u is the size of the domain/universe (or the number of keys). Given the universe U = {0, ..., u-1}, u denotes the size |U|.
  • m is the number of bins or buckets
  • p is a prime which must be equal or greater than n
  • the hash family is defined as H = {h(a,b)(x)} with h(a,b)(x) = ((a * x + b) mod p) mod m. Note that a and b are randomly chosen integers (from all possible integers, so theoretically can be larger than p) modulo a prime p (which can make them either smaller or larger than m, the number of bins/buckets); but here too the data type (domain of values does not matter). See Hashing integers on Wikipedia for notation.
  • Follow the proof on Wikipedia and you conclude that the collision probability is _p/m_ * 1/(p-1) (the underscores mean to truncate the decimals). For p >> m (p considerably bigger than m) the probability tends to 1/m (but this does not mean that the probability would be better the larger p is).

In other terms answering your question: p being a bigger data type is not a problem here and can be even required. p has to be equal or greater than u and a and b have to be randomly chosen integers modulo p, no matter the number of buckets m. With these constraints you can construct a universal hash family.


Maybe a mathematical example could help

  • Let U be the universe of integers that correspond to unsigned char (in C for example). Then U = {0, ..., 255}
  • Let p be (next possible) prime equal or greater than 256. Note that p can be any of these types (short, int, long be it signed or unsigned). The point is that the data type does not play a role (In programming the type mainly denotes a domain of values.). Whether 257 is short, int or long doesn't really matter here for the sake of correctness of the mathematical proof. Also we could have chosen a larger p (i.e. a bigger data type); this does not change the proof's correctness.

    1. The next possible prime number would be 257.
    2. We say we have 25 buckets, i.e. m = 25. This means a hash family would be universal if the collision probability is equal or less than 1/25, i.e. approximately 0.04.
    3. Put in the values for _p/m_ * 1/(p-1): _257/25_ * 1/256 = 10/256 = 0.0390625 which is smaller than 0.04. It is a universal hash family with the chosen parameters.

We could have chosen m = u = 256 buckets. Then we would have a collision probability of 0.003891050584, which is smaller than 1/256 = 0,00390625. Hash family is still universal.

Let's try with m being bigger than p, e.g. m = 300. Collision probability is 0, which is smaller than 1/300 ~= 0.003333333333. Trivial, we had more buckets than keys. Still universal, no collisions.


Implementation detail example

We have the following:

  • x of type int (an element of |U|)
  • a, b, p of type long
  • m we'll see later in the example

    1. Choose p so that it is bigger than the max u (element of |U|), p is of type long.
    2. Choose a and b (modulo p) randomly. They are of type long, but always < p.
    3. For an x (of type int from U) calculate ((a*x+b) mod p). a*x is of type long, (a*x+b) is also of type long and so ((a*x+b) mod p is also of type long. Note that ((a*x+b) mod p)'s result is < p. Let's denote that result h_a_b(x).
    4. h_a_b(x) is now taken modulo m, which means that at this step it depends on the data type of m whether there will be downcasting or not. However, it does not really matter. h_a_b(x) is < m, because we take it modulo m. Hence the value of h_a_b(x) modulo m fits into m's data type. In case it has to be downcasted there won't be a loss of value. And so you have mapped a key to a bin/bucket.
Ely
  • 10,860
  • 4
  • 43
  • 64
  • 1
    I think the issue is not how quick the downcast is, but how it affects the distribution of the hashtable and therefore its overall performance. – RealSkeptic Jul 08 '15 at 20:34
  • Not sure if I understand. I don't see how it affects the distribution. – Ely Jul 08 '15 at 20:38
  • @RealSkeptic I updated the answer. Though I believe the question should be migrated to another network, maybe Theoretical CS or Mathematics. It does not really have much to do with programming. – Ely Jul 09 '15 at 20:58
  • @Elyasin:What you have provided is a theoretical approach of the Universal Hashing. But my question is about the implementation details. You note: `that a and b are randomly chosen integers (from all possible integers, so theoretically can be larger than p)`. So choosing a random `a` and `b` would also be a value stored in a `long` data type e.g using `random.nextLong()`. So after we apply the type `(a*k+b)mod p` this is a value of type `long`. Which means that it can be *any* number in the range that fits a `long` which is bigger than the universe of keys we started with. – Cratylus Jul 09 '15 at 21:25
  • @Elyasin: The last step is to apply the last modulo operation on the bucket size i.e. `((a*k+b)mod p) mod m)`. m is a value within the range of integer but is promoted to `long` and the whole result is of type `long`. This result if I understand correctly is expected to have a certain "distribution" of its bit pattern so that it can be inserted anywhere in the hash table so as to avoid colissions. But the next step is to downcast the long value to int. So does this downcast (which essentially removes part of the bit pattern) affect the colission rate or not? – Cratylus Jul 09 '15 at 21:25
  • @Elyasin:Keep in mind that in hash tables of power 2 it is the bit pattern in the lower bits that leads to colissions and the approach to solve it is via avalanche. – Cratylus Jul 09 '15 at 21:25
  • `a` and `b` can have a larger domain (`a != 0`) but they are chosen randomly `modulo p`. This accords with the theoretical foundation. You can choose `a` or `b` even to be of type `long long int` and you can choose them to be larger than `p`. However they have to be calculated `modulo p` before you run the LCG step, i.e. `a*x+b`. Does that lift the confusion? I might have not chosen a good wording... but that is basically how it should work. – Ely Jul 09 '15 at 22:32
  • @Elyasin:`You can choose a or b even to be of type long long int and you can choose them to be larger than p`. No as a and b *must* be less <= `p-1` so they can be `long`. So basically `(a*k+b)` is a value that fits in a `long` (and not an int) and `(a*k+b)mod p` is also a value that *may* fit in a `long`. So the last operation is `((a*k+b)mod p) mod m)` this determines the distribution of the key in the buckets. So I am uncertain/confused about this chaining to `m` as the only requirement on `m` is that `m < p`. – Cratylus Jul 10 '15 at 18:53
  • Don't forget that p must be equal or greater than u (m does not play a role here). If m is bigger than p then we have more buckets than keys because u <= p and thus no collisions; it is in my answer, the last example). The type merely describes the domain; it does not mean that you must use the whole range. So it can be of type long long int (although it is a waste of space of course), but they have to be < p. (a*k+b) is of an appropriate type too (be it long or long long int), but again it will be < p (because we will take it modulo p). – Ely Jul 10 '15 at 19:19
  • So (a*k+b) mod p is of whatever long or long long int, but the value will always be < p. That is perfectly fine. So until now there is no downcasting (in fact downcasting would destroy the distribution property). Then a modulo m means that this result fits into m's data type (domain of values). In case a downcast is necessary there is no loss of value. It is then a mapping from a key (int) to a bin/bucket (whatever data type that is). – Ely Jul 10 '15 at 19:19
  • I added a section "Implementation details example". I hope it helps. – Ely Jul 11 '15 at 08:47
  • @Elyasin:I have one more question. When you say in (2) `Choose a and b (modulo p) randomly` I am not clear on this. Is this different than picking *any* random number less than p? My understanding is that *any* random number less than `p` is acceptable and eventually we would apply the modulo on `p` – Cratylus Jul 11 '15 at 12:52
  • It is not different. You can do both. Choosing integer `a` and `b` randomly in the range `[0, p)` (`a != 0` because it is one of the constraints) is probably more efficient than choosing `a` and `b` from the whole `long` range and then take it `modulo p`. Once you have chosen `a` and `b` from the range `[0,p)` there is no need to do the modulo operation in fact (`a != 0` constraint remains of course). – Ely Jul 11 '15 at 13:05