1

I have 232 4-byte keys that I'm hashing; what's the chance of collision?

What if I have 264 8-byte keys (not really storing every key, but I want to know the worst case)?

  • 2
    Is this homework? Whether it is or not, what have you figured out so far? – Felix Kling Feb 04 '11 at 15:00
  • What are you using to store 2**64 keys? Is it affordable? I'd like to be able to store that much data… – Donal Fellows Feb 04 '11 at 15:02
  • `@Felix Kling:` This isn't homework; I've worked out the general formula (probability that at least two keys are the same) for *n* random keys drawn from a discrete uniform distribution with range [ 0 , *d* ]. It works for small numbers, but *n* = 2^32 and *d* = 2^128 is too big. – Supercollider Feb 04 '11 at 15:05
  • Duplicate of http://stackoverflow.com/questions/1867191/probability-of-sha1-collisions (just change the variables to match MD5's hash space) – bdonlan Feb 04 '11 at 15:06
  • `@Donal Fellows:` I'm not storing *every* key from 0 to 2^64; I'm just trying to figure out the worst case. – Supercollider Feb 04 '11 at 15:06
  • `@bdonlan:` I know the general formula ( *n* = 2^32, *d* = 2^128 ). These are very specific values for *n* (4-byte or 8-byte keys), but very large, so I can't plug them into the formula. – Supercollider Feb 04 '11 at 15:08
  • If the formula is n*(n-1)/(2**(b+1)) then with large n, take n==n-1 so you get 2**64/2**129 == 2**-65 – bart Feb 04 '11 at 15:18
  • @Super: There's worst case and there's being ridiculous. – Donal Fellows Feb 04 '11 at 15:26

2 Answers2

3

Per the wikipedia page on the Birthday Problem, a good first order approximation can be found with 1-e^(-(n^2)/d). Graphing this for your values gives this graph (logarithmic horizontal axis, I've zoomed in on where the probability starts to spike). Note that this is only an approximation, and should be considered conservatively (ie, the real probability may be somewhat higher, but it should be in the right ballpark).

Nightfirecat
  • 11,432
  • 6
  • 35
  • 51
bdonlan
  • 224,562
  • 31
  • 268
  • 324
0

What are you doing with the hash codes? If you're using them to work out whether two pieces of data are the same, an MD5 hash is pretty good, though only if you are working with data that is not being created by malicious entities. (Cryptographic purposes need better hash algorithms precisely in order to deal with the "malicious attacker" problem.)

If you're using them for building a map (i.e., you're building a hash table) it's usually better to use a cheap hash and come up with a way to mitigate the costs of collision (e.g., by hanging a linked list off the hash table and resizing/rebuilding when the average weight gets too large).

Donal Fellows
  • 133,037
  • 18
  • 149
  • 215