1

I am using crc32 function of PHP to generate numeric equivalent for MongoId since i am using this numeric id in mysql search since string search is slow.

I came across a case where crc32 gives same numeric value for two different mongoid's.

Any help or suggestion would be greatly appreciated.

Thanks Gaurav

coder
  • 283
  • 4
  • 26
  • this might be relevant http://stackoverflow.com/questions/14210298/probability-of-collision-when-using-a-32-bit-hash – Alex Andrei Dec 02 '15 at 05:25

2 Answers2

2

Unless your strings are four bytes or less, then it is inevitable that many strings will have any given CRC-32 value. If you have even just one more than 2^32 possible strings, then it is absolutely guaranteed that at least two of those strings will map to the same CRC-32.

There is no help or suggestion. You cannot expect there to be no collisions, unless there are fewer possible strings than possible CRCs.

By the way, you can construct such cases intentionally with my spoof code, which lets you give it a set of bits you would allow to be changed in a string, and it will tell you which of those bits to flip in order to get a desired CRC.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
2

@MarkAdler's answer explains why you are getting a hash collision. But if I were in your shoes, I'd be more interested in what I could do about it.

What you could do, of course, is use a different hashing algorithm that produces longer hashes (less chance of collision), but is still acceptably fast. You'll find a highly-rated review of several alternatives in this question from programmers.stackexchange.com. They all have collisions (coincidentally CRC32 did pretty well in that answer's test sets), but you could try some of them on mongoids and see what happens.

I also found this clever suggestion: To generate a 64-bit hash, you could take two different 32-bit hash algorithms and concatenate the hashes (of course this will more or less halve the speed of your hashing).

The more robust solution would be to write your code with the understanding that a hash is a bucket, and you will sometimes get multiple results (or the wrong result) from a crc32 query. Simply add a second step to check the unhashed Id(s) of the returned records. Since there's only ever going to be a handful of hits, it won't take long at all.

Community
  • 1
  • 1
alexis
  • 48,685
  • 16
  • 101
  • 161