3

I've got millions of data records that are each about 2MB in size. Every one of these pieces of data are stored in a file and there is a set of other data associated with that record (stored in a database).

When my program runs I'll be presented, in memory, with one of the data records and need to produce the associated data. To do this I'm imagining taking an MD5 of the memory, then using this hash as a key into the database. The key will help me locate the other data.

What I need to know is if an MD5 hash of the data contents is a suitable way to uniquliy identify a 2MB piece of data, meaning can I use an MD5 hash without worrying too much about collisions?

I realize there is a chance for collision, my concern is how likely is the chance for collision on millions of 2MB data records? Is collision a likely occurrence? What about when compared to hard disk failure or other computer failures? How much data can MD5 be used to safely identify? what about millions of GB files?

I'm not worried about malice or data tampering. I've got protections such that I wont be receiving manipulated data.

stuck
  • 2,264
  • 2
  • 28
  • 62

1 Answers1

3

This boils down to so-called Birthday paradox. That Wikipedia page has simplified formulas for evaluating the collision probability. It will be very some very small number.

The next question is how you deal with say 10-12 collision probability - see this very similar question.

Community
  • 1
  • 1
sharptooth
  • 167,383
  • 100
  • 513
  • 979
  • Only that the year has 2^128 days. – Gumbo Dec 06 '10 at 08:16
  • Wikipedia says: "10^18 to 10^15 is the uncorrectable bit error rate of a typical hard disk [2]. In theory, MD5, 128 bits, should stay within that range until about 820 billion documents." – Julius Musseau Dec 06 '10 at 08:20