16

The book Numerical Recipes offers a method to calculate 64bit hash codes in order to reduce the number of collisions.

The algorithm is shown at http://www.javamex.com/tutorials/collections/strong_hash_code_implementation_2.shtml and is copied here for reference:

private static final createLookupTable() {
  byteTable = new long[256];
  long h = 0x544B2FBACAAF1684L;
  for (int i = 0; i < 256; i++) {
    for (int j = 0; j < 31; j++) {
      h = (h >>> 7) ^ h;
      h = (h << 11) ^ h;
      h = (h >>> 10) ^ h;
    }
    byteTable[i] = h;
  }
  return byteTable;
}

public static long hash(CharSequence cs) {
  long h = HSTART;
  final long hmult = HMULT;
  final long[] ht = byteTable;
  final int len = cs.length();
  for (int i = 0; i < len; i++) {
    char ch = cs.charAt(i);
    h = (h * hmult) ^ ht[ch & 0xff];
    h = (h * hmult) ^ ht[(ch >>> 8) & 0xff];
  }
  return h;
}

My questions:

1) Is there a formula to estimate the probability of collisions taking into account the so-called Birthday Paradox?

2) Can you estimate the probability of a collision (i.e two keys that hash to the same value)? Let's say with 1,000 keys and with 10,000 keys?

EDIT: rephrased/corrected question 3

3) Is it safe to assume that a collision of a reasonable number of keys (say, less than 10,000 keys) is so improbable so that if 2 hash codes are the same we can say that the keys are the same without any further checking? e.g.

static boolean equals(key1, key2) {

  if (key1.hash64() == key2.hash64())
    return true;  // probability of collision so low we don't need further check

  return false;
}

This is not for security, but execution speed is imperative so avoiding further checks of the keys will save time. If the probability is so low, say less than (1 in 1 billion for 100,000 keys) it will probably be acceptable.

TIA!

isapir
  • 21,295
  • 13
  • 115
  • 116

4 Answers4

47

Is there a formula to estimate the probability of collisions taking into account the so-called Birthday Paradox?

Using the Birthday Paradox formula simply tells you at what point you need to start worrying about a collision happening. This is at around Sqrt[n] where n is the total number of possible hash values. In this case n = 2^64 so the Birthday Paradox formula tells you that as long as the number of keys is significantly less than Sqrt[n] = Sqrt[2^64] = 2^32 or approximately 4 billion, you don't need to worry about collisions. The higher the n, the more accurate this estimation. In fact the probability p(k) that a collision will occur with k keys approaches a step function as n gets larger, where the step occurs at k=Sqrt[n].


Can you estimate the probability of a collision (i.e two keys that hash to the same value)? Let's say with 1,000 keys and with 10,000 keys?

Assuming the hash function is uniformly distributed it's straightforward to derive the formula.

p(no collision for k keys) = 1 * (n-1)/n * (n-2)/n * (n-3)/n * ... * (n-(k-1))/n

That formula directly follows from starting with 1 key: The probability of no collision with 1 key is of course 1. The probability of no collision with 2 keys is 1 * (n-1)/n. And so on for all k keys. Conveniently, Mathematica has a Pochhammer[] function for this purpose to express this succinctly:

p(no collision for k keys) = Pochhammer[n-(k-1),k]/n^k

Then, to calculate the probability that there is at least 1 collision for k keys, subtract it from 1:

p(k) = 1 - p(no collision for k keys) = 1 - Pochhammer[n-(k-1),k]/n^k

Using Mathematica, one can calculate for n=2^64:


Is it safe to assume that a collision of a reasonable number of keys (say, less than 10,000 keys) is so improbable so that if 2 hash codes are the same we can say that the keys are the same without any further checking?

To answer this precisely depends upon the probability that 2 of the 10,000 keys were identical. What we are looking for is:

p(a=b|h(a)=h(b)) = The probability that a=b given h(a)=h(b)

where a and b are keys (possibly identical) and h() is the hashing function. We can apply Bayes' Theorem directly:

p(a=b|h(a)=h(b)) = p(h(a)=h(b)|a=b) * p(a=b) / p(h(a)=h(b))

We immediately see that p(h(a)=h(b)|a=b) = 1 (if a=b then of course h(a)=h(b)) so we get

p(a=b|h(a)=h(b)) = p(a=b) / p(h(a)=h(b))

As you can see this depends upon p(a=b) which is the probability that a and b are actually the same key. This depends upon how the group of 10,000 keys were selected in the first place. The calculations for the previous two questions assume all keys are distinct, so more information on this scenario is needed to fully answer it.

Matt
  • 20,108
  • 1
  • 57
  • 70
  • 1
    Good answer on number 2. On number 1, there is no "step" at any value; rather, what happens at about sqrt(n) is that a collision becomes more likely than not. On number 3, you answered the question he probably meant to ask, but the answer to the question he actually asked - "if 2 hash codes are different we can say that the keys are different without any further checking" is always yes since if the keys were the same they would hash to the same hash. – Warren Dew Feb 26 '14 at 02:47
  • thank you for an excellent answer. this is not a matter of convenience but more of a matter of speed. it seems like collision is unlikely even for 100,000 or 1 million keys (I tried to enter your equation into WolframAlpha but couldn't get it to work). – isapir Feb 26 '14 at 02:48
  • @WarrenDew Regarding the step function: the key word is *approaches*. For any finite `n`, the probability function of a collision `p(k)` is strictly increasing such that `p(1)=0` and `p(k)=1` for `k>n`. As `n` goes to infinity, `p(k)` approaches the [unit step function](http://en.wikipedia.org/wiki/Heaviside_step_function), and the step occurs at `k=Sqrt[n]`. This is simply the mathematical result of applying precalculus limits to `p(k)` (though somewhat tedious). For number 3, you're right; I answered the question the OP *meant* to ask. :) – Matt Feb 26 '14 at 04:10
  • 2
    Your chance of collision is approximately 50% when the number of hash codes generated is `sqrt(n)`. Saying "you don't have to worry about collisions when the number is less than `sqrt(n)`" is stretching things. With a 64-bit hash code, the chance of collision is one in a million when you hash just six million items, and it goes up pretty quickly from there. Considering how often some programs run, "one in a million" isn't very good odds. See http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=792. – Jim Mischel Feb 26 '14 at 15:38
  • @JimMischel Yes if question #1 is all that was posed, I would have spent more time with an example, such as you have given. Since examples **were** given in question #2, this allowed for some brevity in answering question #1 to express the essential spirit of the sqrt[n] approximation. The necessary disclaimers I think are in fact contained within the answer as a whole. In any case the phrasing can always be improved so to your point I changed "at simply `Sqrt[n]`" to "at around `Sqrt[n]`". Thanks. – Matt Feb 26 '14 at 16:13
2

I'll provide a rough approximation to the exact formulas provided in the other answers; the approximation may be able to help you answer #3. The rough approximation is that the probability of a collision occurring with k keys and n possible hash values with a good hashing algorithm is approximately (k^2)/2n, for k << n. For 100,000 keys with a 64 bit hash, that's 10^10 / 32x10^18 or about 1 in 3 billion.

However, I suspect that if you go with not checking the actual key values on collision, there is a larger chance you'll find out the hashing algorithm is not "good" enough, after all.

Warren Dew
  • 8,790
  • 3
  • 30
  • 44
1

Is there a formula to estimate the probability of collisions taking into account the so-called Birthday Paradox?

See: Birthday attack.

Assuming the distribution of hashes is uniform, the probability of a collision for n keys is approximately n2/265.

Is it safe to assume that a collision of a reasonable number of keys (say, less than 10,000 keys) is so improbable so that if 2 hash codes are different we can say that the keys are different without any further checking?

It's only safe when you use a cryptographic hash function. Even if you can tolerate a mistake every 3*1011 times, you may have to consider the possibility that the input is specifically built to create a hash collision, as an attack on your program.

Anton
  • 3,113
  • 14
  • 12
1

1) Is there a formula to estimate the probability of collisions taking into account the so-called Birthday Paradox?

The probability of a single collision occurring depends on the key set generated as the hash function is uniform we can do following to calculate the probability that collision doesnt occurs at generation of k keys as follows :-

x = hash size
p(k=2) = (x-1)/x
p(k=3) = p(k=2)*(x-2)/x
..
p(k=n) = (x-1)*(x-2)..(x-n+1)/x^n

p(k=n) ~ e^-(n*n)/2x

p(collision|k=n) = 1-p(k=n) = 1 - e^(-n^2)/2x
p(collision) > 0.5 if n ~ sqrt(x)

Hence if sqrt(2^64) keys that is 2^32 key are generated there is higher chance that there is a single collision.

2) Can you estimate the probability of a collision (i.e two keys that hash to the same value)? Let's say with 1,000 keys and with 10,000 keys?

x = 2^64 
Use the formula pc(k=n) = 1 - e^-(n^2)/2x

3) Is it safe to assume that a collision of a reasonable number of keys (say, less than 10,000 keys) is so improbable so that if 2 hash codes are the same we can say that the keys are the same without any further checking?

This is a very interesting question because it depends on the size of key space. Suppose your keys are generated at random from space of size = s and hash space is x=2^64 as you mentioned. Probability of collision is Pc(k=n|x) = 1-e^(-n^2)/2x. If Probability of choosing same key in key space is P(k=n|s) = 1-e^(-n^2)/2s . For it to be sure that if hash is same then keys are same:-

P(k=n|s) > Pc(k=n|x)
1-e^-(n^2/2s) > 1-e^-(n^2/2x) 
n^2/2s > n^2/2x 
s < x
s < 2^64

Hence it shows that for keys to be same if hash is same that key set size must be small than 2^64 approx otherwise there is a chance of collision in hash more than in key set. The result is independent of number of keys generated.

Vikram Bhat
  • 6,106
  • 3
  • 20
  • 19