57

I have a 10-character string key field in a database. I've used CRC32 to hash this field, but I'm worrying about duplicates. Could somebody show me the probability of collision in this situation?

P.S.: My string field is unique in the database. If the number of string fields is 1 million, what is the probability of a collision?

wovano
  • 4,543
  • 5
  • 22
  • 49
nguyenngoc101
  • 1,211
  • 4
  • 16
  • 28

2 Answers2

117

Duplicate of Expected collisions for perfect 32bit crc

The answer referenced this article: http://arstechnica.com/civis/viewtopic.php?f=20&t=149670

Found the image below from: http://preshing.com/20110504/hash-collision-probabilities

enter image description here

Community
  • 1
  • 1
Adam Morris
  • 8,265
  • 12
  • 45
  • 68
47

In the case you cite, at least one collision is essentially guaranteed. The probability of at least one collision is about 1 - 3x10-51. The average number of collisions you would expect is about 116.

In general, the average number of collisions in k samples, each a random choice among n possible values is:

N(n,k)~=k(k-1)/(2n)

The probability of at least one collision is:

p(n,k)~=1-e^(-k(k-1)/(2n))

In your case, n = 232 and k = 106.

The probability of a three-way collision in your case is about 0.01. See the Birthday Problem.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158