9

This is a combinatorics question with some theory in hashing algorithms required.

Let's say the input can be any random sequence of bytes 30 kB to 5 MB of size (I guess that makes quite a few combinations of input values :))

What is the probability that the first 4 bytes (or first n bytes) of a MD5 hash computed from the byte sequence will be the same for distinct files?

In case this can not be computed specifically for MD5 hash, then what is the probability that any hash function that generates uniformly distributed m-byte hashes will compute hash with collision on first n bytes for given range of inputs?

Marek
  • 10,307
  • 8
  • 70
  • 106
  • Clarification: Please do not comment on security of MD5. The problem I am trying to solve is how to detect identical files, not security related. – Marek Nov 13 '09 at 09:22
  • The recent implementation of ZFS deduping led to some interesting insights; hash collisions did cause an interesting attack there. If you can know a file you just created is identical to another file, you effectively defeat _file system_ security. In fact, the file system _insecurity_ there was a direct consequence of SHA256 _security_ - if you had identical SHA256 hashes, you pretty much know you had identical files. – MSalters Nov 13 '09 at 12:39
  • 1
    Can you please clarify what you mean with bytes? When a hash starts with "E8:F0:D3:03:..." are the first four bytes "E 8 F 0" or "E8 F0 D3 03"? – tanascius Jul 19 '11 at 13:23
  • I'm surprised that none of the comments here mentioned the file header issue. If you're working with a common type like Excel, Word, AutoCAD, JPG, etc, they have known file headers that would likely drastically increase the collision rate. – Origin Nov 25 '12 at 01:29

6 Answers6

9

In absence of more information about the byte values probability, I would day it is 1 in 2^32.

EDIT. Indeed, 1 in 2^16 if you are taking the hexadecimal characters instead of the pure bytes.

EDIT based on comment:

Can MD5 be considered that uniform that a the computed values are absolutely random?

MD5 hash algorithm is designed so that a small change in the input results in a completely different hash, so I would say that MD5 hash bytes are distributed with equal probability (I would not bet anything on it anyway). Anyway you can apply a post-processing to your hash (for example you can use keyed MD5) to increase its randomness (and to make it more secure, by the way, since plain MD5 hashes have been proved to be insecure).

Konamiman
  • 49,681
  • 17
  • 108
  • 138
  • 1
    ?? Changing the representation from bytes to hexadecimal characters changes the probabilities. THAT does not compute. – High Performance Mark Nov 13 '09 at 08:27
  • 2
    Well, it depends on "the first 4 bytes" meaning the first real 4 bytes or the first 4 digits of the hex representation of the hash. – Konamiman Nov 13 '09 at 08:28
  • 1
    Thanks for clarification. Does your answer (and other that say 1/65536 is the probability) take into account birthday paradox? – Marek Nov 13 '09 at 09:21
  • 3
    @Marek: The birthday paradox applies to collections, not pairs. The chance given here is correct for a pair of files. Since we don't know how many files there are in total, the total chance is unknown. – MSalters Nov 13 '09 at 12:34
4

For an ideal hash function, the outputs are evenly distributed, so the chances of two colliding are one in 2^32. The birthday paradox, however, tells us that if we're comparing all pairs of hashes, we should expect to see a collision once we have 2^16 hashes, on average - so don't rely on only 4 bytes on the basis that "I have a lot less than 4 billion values".

MD5 isn't an ideal hash function, as we know, but the weaknesses are somewhat incidental here: finding a collision on 4 bytes is well within the realm of a reasonable brute-force attack, so there's no need to resort to cryptographic weaknesses. If you're only concerned about randomly picked data, you're not going to see a significant statistical deviation from randomness.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
3

If you are interested in the odds of two particular inputs having the same 4-byte hash, then it's just 1/2^32. If you're interested in the odds of two inputs out of a set of X total inputs having the same odds, this stays fairly low until you start approaching 2^16 = 65536 distinct inputs in your set, where it reaches near 50% (this phenomenon is known as the Birthday Paradox).

In general, one of the criteria for a hash function to be cryptographically useful is the uniformity across all of the bits.

Yuliy
  • 17,381
  • 6
  • 41
  • 47
3

The odds of a collision in a n-bit hash are around 1 in 2^(n/2) due to the birthday paradox - so about 1 in 2^16 in this case. If for some reason you were referring to using 32 bits of the hex encoding, of course that would only be the first 16 actual bits, so then the odds of a collision would be about 1 in 2^8.

Given a specific fixed file, the odds that any other file chosen at random will have the same hash as that file is about 2^n. In terms of cryptographic hashes the difference between these is the first is a collision, the other is a preimage.

At this hash size, the weaknesses in MD5 are pretty irrelevant since the best known attacks on MD5 take roughly 2^32 computations, while one can generate a collision in even an ideally secure 32-bit hash in around 2^16 computations (since by merely choosing random inputs, you have 1 in 2^16 chance of a collision, so after roughly 2^16 random guesses you'll probably have found a colliding pair of inputs).

Jack Lloyd
  • 8,215
  • 2
  • 37
  • 47
  • +1. Good answer, the distinction between finding **a** collision (birthday paradox complexity) and **another file** that hashes to the same digest (second preimage) is important. – Kris Nov 13 '09 at 15:01
0

MD5 hashes are typically hexadecimals, so there are 16 possible values for each byte. Therefore for four bytes there are 16*16*16*16 = 65536 possible combinations, making the probability of a hash collision 1:65536.

Kaivosukeltaja
  • 15,541
  • 4
  • 40
  • 70
  • 4
    They are displayed to humans in hexadecimals, but are calculated as 4 DWORDS thus he is referring to the first DWORD. which is 4 bytes == 2^32 – Simeon Pilgrim Nov 13 '09 at 08:26
  • Thank you for quick answer. However, does this take into account specific properties of MD5? Can MD5 be considered that uniform that a the computed values are absolutely random? Isn't the collision probability larger specifically for MD5 as compared to ideal hashing function that is absolutely uniform? – Marek Nov 13 '09 at 08:30
  • Provided that the input is "absolutely random", it is reasonable to assume that output would be "absolutely random" as well. Hash functions are designed to be uniform, so in practice it is safe to assume uniformity, unless the input is specifically crafted to cause collisions with the given hash function. – VladV Nov 13 '09 at 08:42
-1

md5 is hexadecimal, so each character can be any one of 16 alleles. So that would make 16^n

For 4 characters, that makes 65536 different possible combinations.

David Hedlund
  • 128,221
  • 31
  • 203
  • 222