9

I use MD5 hash for identifying files with unknown origin. No attacker here, so I don't care that MD5 has been broken and one can intendedly generate collisions.

My problem is I need to provide logging so that different problems are diagnosed easier. If I log every hash as a hex string that's too long, inconvenient and looks ugly, so I'd like to shorten the hash string.

Now I know that just taking a small part of a GUID is a very bad idea - GUIDs are designed to be unique, but part of them are not.

Is the same true for MD5 - can I take say first 4 bytes of MD5 and assume that I only get collision probability higher due to the reduced number of bytes compared to the original hash?

Community
  • 1
  • 1
sharptooth
  • 167,383
  • 100
  • 513
  • 979
  • possible duplicate of http://stackoverflow.com/questions/860405/are-the-first-32-bits-of-an-md5-hash-just-as-random-as-any-other-substring – Andreas Brinck May 06 '10 at 10:08
  • I'm wondering if getting the first 4 bytes is better than using the CRC32 of the md5 hash. – Nick Dandoulakis May 06 '10 at 10:13
  • Yes, because the first 32 bits in MD5 is supposed to be perfectly randomly distributed, so you can't improve the distribution. – Andreas Brinck May 06 '10 at 10:19
  • @Nick D: Yes, since I already have that MD5 and it is controlling my program flow, while the CRC32 would be completely unrelated to it. – sharptooth May 06 '10 at 13:25
  • sharptooth: I meant statistically better. Andreas is right. CRC32 is pointless since MD5 was designed to has good random distribution across the whole 2^128-bit range. – Nick Dandoulakis May 06 '10 at 15:48

3 Answers3

8

The short answer is yes, you can use the first 4 bytes as an id. Beware of the birthday paradox though:

http://en.wikipedia.org/wiki/Birthday_paradox

The risk of a collision rapidly increases as you add more files. With 50.000 there's roughly 25% chance that you'll get an id collision.

EDIT: Ok, just read the link to your other question and with 100.000 files the chance of collision is roughly 70%.

Andreas Brinck
  • 51,293
  • 14
  • 84
  • 114
1

Here is a related topic you may refer to

What is the probability that the first 4 bytes of MD5 hash computed from file contents will collide?

Community
  • 1
  • 1
ZelluX
  • 69,107
  • 19
  • 71
  • 104
1

Another way to shorten the hash is to convert it to something more efficient than HEX like Base64 or some variant there-of.

Even if you're determined to take on 4 characters, taking 4 characters of base64 gives you more bits than hex.

shoosh
  • 76,898
  • 55
  • 205
  • 325
  • only if you turn the hex into it's 0-F form. You can still take the raw byte values and use that instead. Which in that case would be more bits than base-64 – Sekhat Oct 28 '10 at 09:01