I want to use files' md5 as key to store mp3 files, but I'm afraid that different files have same md5. So I want to know what's 128bit md5 's clash rate on different count of files, such as 10 million or 50 million? Is there any tool or formula by which I can calculate the rate directly?
-
@mvp I've seen that problem before, they are taking about the theoretic clash rate, meaning when items' count is up to 2^64, its clash rate is 100%. But I want to know the exact clash rate on defined size's set. For example, how many items would make the rate up to 50%. – magicyang May 20 '13 at 09:52
-
*All* known collisions so far were carefully crafted. To the best of my knowledge, not a single *accidental* md5 collision has ever happened – mvp May 20 '13 at 09:54
-
@mvp Do you mean that I can use md5 as key to store 50 million audio files and don't need to worry about anything? – magicyang May 20 '13 at 09:58
-
I strongly believe so. – mvp May 20 '13 at 09:59
-
50 million? Yes no problem. 50 million is not even close to the number you'll need to have any chance of a collision. – Mike Vine May 20 '13 at 10:03
1 Answers
Assuming MD5 is perfect.
For 50million files, there are 50000000 x 49999999 / 2 possible clashes [each file against each other file].
This is 2499999950000000.
There are 2^128 = 3.4028236692093846346337460743177e+38 possible md5 hashes.
Therefore 50milllion files has 2499999950000000 / 2 * 3.4028236692093846346337460743177e+38 = 3.6e-24 chance of clashing [approximately].
This means there is 0.000000000000000000000036% chance of a clash.
Which is pretty much the same as 'never'
Now MD5 is known to NOT be perfect, and it is attackable. However, for normal files (i.e. ones not specifically generated to be a problem) it can be considered so.
So if the users control the uploaded files and can attack the system then you should use SHA2 or better, if there's no issue here md5 is perfectly fine.

- 9,468
- 25
- 44
-
I believe you are off by a factor two. The number of pairs is only half of what you are calculating, the order doesn't matter. See e.g. Wikipedia Birthday paradox. – user515430 May 21 '13 at 05:25
-