9

I have a client who is distributing large binary files internally. They are also passing md5 checksums of the files and apparently verifying the files against the checksum before use as part of their workflow.

However they claim that "often" they are encountering corruption in the files where the md5 is still saying that the file is good.

Everything I've read suggests that this should be hugely unlikely.

Does this sound likely? Would another hashing algorithm provide better results? Should I actually be looking at process problems such as them claiming to check the checksum, but not really doing it?

NB, I don't yet know what "often" means in this context. They are processing hundreds of files a day. I don't know if this is a daily, monthly or yearly occurrence.

Gareth Simpson
  • 36,943
  • 12
  • 47
  • 50
  • 6
    Have them provide an example of a "corrupt" file and the "good" original. – Anon. Feb 07 '11 at 23:06
  • 2
    Is it possible that the md5 sum was computed on a corrupt file, or that the corruption is occurring after the sum was computed? To know for sure, take Anon's suggestion and get an example of two files with the same checksum. – BMitch Feb 07 '11 at 23:09
  • So since then, have you looked at the bittorrent sync idea? getsync.com] – dlamblin Mar 10 '14 at 13:58

6 Answers6

14

MD5 is a 128 bit cryptographic hash function, so different messages should be distributed pretty well over the 128-bit space. That would mean that two files (excluding files specifically built to defeat MD5) should have a 1 in 2^128 chance of collision. In other words, if a pair of files is compared every nanosecond, it wouldn't have happened yet.

recursive
  • 83,943
  • 34
  • 151
  • 241
  • 2
    Well, you are aware that it did happen already, aren't you? Of course those collisions were provoked (one was trying to make two different files that have the same MD5 checksum), yet this doesn't change the fact that there are several files known to mankind (and these are also out in the wild) that produce exactly the same MD5 checksum, even though they contain totally different data. – Mecki Mar 27 '13 at 00:39
  • 2
    @Mecki: Did you read the part where I said "(excluding files specifically built to defeat MD5)"? – recursive Mar 27 '13 at 06:56
  • But you never have just two files, you have a set of files and you don't want any two to hash to the same value. The probability is supposed to be sqrt(2^128) which is 2^64. You can store about 4.3 billion files if you used a 64 bit hash, or 280 trillion files with MD5's speace. When you've reached 2^128 files you're guaranteed that your next file will collide, if you've managed to avoid collisions to that point; which you can't have, practically. – dlamblin Mar 07 '14 at 15:59
  • 3
    That's true because of the "birthday paradox". Nonetheless, no one has enough documents to make it plausible that they'd accidentally have a collision. No one has 100 trillion documents. – recursive Mar 11 '14 at 23:31
  • I think he is not talking about collisions, but something that looks even much more unlikely. A corruption in the file content and at the same time a corruption in the hash itself so both things match. – Jorge González Lorenzo Oct 16 '19 at 12:22
7

If a file is corrupted, then the probability that the corrupted file has the same md5 checksum as the uncorrupted file is 1:2^128. In other words, it will happen almost as "often" as never. It is astronomically more likely that your client is misreporting what really happened (like they are computing the wrong hash)

Nordic Mainframe
  • 28,058
  • 10
  • 66
  • 83
6

Sounds like a bug in their use of MD5 (maybe they are MD5-ing the wrong files), or a bug in the library that they're using. For example, an older MD5 program that I used once didn't handle files over 2GB.

This question suggests that, on average, you get a collision on average every 100 years if you were generating 6 billion files per second, so it's quite unlikely.

Community
  • 1
  • 1
Seth
  • 45,033
  • 10
  • 85
  • 120
4

Does this sound likely?

No, the chance of a random corruption causing the same checksum is 1 in 2128 or 3.40 × 1038. This number puts 1 in a billion (109) chance to shame.

Would another hashing algorithm provide better results?

Probably not. While MD5 has been broken for collision-resistance against attack, it's fine against random corruption and a popular standard to use.

Should I actually be looking at process problems such as them claiming to check the checksum, but not really doing it?

Probably, but consider all possible points of problems:

  1. File corrupted before MD5 generation
  2. File corrupted after MD5 verification.
  3. MD5 program or supporting framework has a bug
  4. Operator misuse (unintentional, e.g. running MD5 program on wrong file)
  5. Operator abuse (intentional, e.g. skipping the verification step)

IF it is the last, then one final thought is to distribute the files in a wrapper format that forces the operator to unwrap the file, but the unwrapping does verification during extraction. I thinking something like Gzip or 7-Zip that supports large files and possibly turning off compression (I don't know that those do).

Bert F
  • 85,407
  • 12
  • 106
  • 123
0

There are all sorts of reasons that binaries either won't get distributed or if they do, there is corruption (firewall, size limitation, virus insertions, etc). You should always encrypt files (even a low level encryption is better than none) when sending binary files to help protect data integrity.

Roger
  • 1
0

Couldn't resist a back-of-envelope calculation:

There are 2^128 possible MD5 hashes or c. 3.4 x 10^38 (that is odds 340 billion, billion, billion, billion, billion, billion, billion, billion, billion,billion, billion to 1 against). Lets call this number 'M'

The probability of the Kth hash matching if the 1 to (K-1)th matches didn't is (1-(K-1)/M) as we have K-1 unique hashes already out of M.

And P(no duplicate in N file hashes) = Product [k = 1...N] (1-(k-1)/M). When N^2 <<< M then this approximates to 1 - 1/2 N^2 / M and P(one or more duplicates) = 1/2 N^2 / M when 1/2 N^2 is an approximation to the number of pair-wise matches of hashes that have to be made

So lets say we take photograph of EVERYONE ON THE PLANET (7.8 billion, or a little under 2^33) then there are 30.4 billion billion billion pair-wise comparisons to make (a little under 2^65).

This means that the chance of a matching MD5 hash (assuming perfectly even distribution) is still 2^65/2^128 = 2^-63 or 1 in 10,000,000,000,000,000,000.

MD5 is a pretty decent hash function for non-hostile environments which means the chance of your clients having a false match is far less likely than say the chance of their CEO going crazy and burning down the data centre, let alone the stuff they actually worry about.

Duke Bouvier
  • 111
  • 4