50

Given two different messages, A and B (maybe 20-80 characters of text, if size matters at all), what is the probability that the MD5 digest of A is the same as the MD5 digest of B and the SHA1 digest of A is the same as the SHA1 digest of B? That is:

(MD5(A) == MD5(B)) && (SHA1(A) == SHA1(B))

Assume no malicious intent, i.e., that the messages are not selected with an aim of finding a clash. I just want to know the odds of this happening naturally.

I'm thinking the chances are "astronomically low," but I'm not sure how to verify this.

More information: the size of the pool of possible messages is restricted, but large (several hundred million). Birthday paradox situations are exactly what I'm worried about.

Mechanical snail
  • 29,755
  • 14
  • 88
  • 113
John Siracusa
  • 14,971
  • 7
  • 42
  • 54
  • question doesn't really make sense. SHA-1 is stronger than MD5, so the probability of any collision is lower for SHA-1 anyway.... – Mitch Wheat Aug 24 '09 at 15:20
  • 6
    @Mitch and the probability that both will clash for a given message is less than the probability that either will clash. – Sinan Ünür Aug 24 '09 at 15:28
  • @Sinan Ünür: Obviously and your point was? – Mitch Wheat Aug 24 '09 at 15:59
  • 5
    I think his point is that the reason the OP is using both is to further reduce the chances of a collision - if there's a collision in MD5, the hope is that there wouldn't be a collision with SHA-1's different algorithm. No need to be snippy. – ceejayoz Aug 24 '09 at 16:10
  • Argh! Why do you change the nature of the question like that? – Welbog Aug 24 '09 at 17:35
  • See the edit in my answer for an analysis using the birthday paradox as a guide. – Welbog Aug 24 '09 at 18:00
  • 7
    If the goal is to reduce the chances of a collision, just use a hashing system that generates a larger digest. Like SHA-256, SHA-384, or SHA-512. – Fantius Aug 24 '09 at 18:12
  • 4
    @fantius The problem MAY be (depending on an application) that SHA-256, SHA-384, SHA-512 1) take longer to calculate, 2) the resulting hashes take more space than the concatenation of MD5 and SHA-1, and/or 3) The deployed system has hardware for MD5 and SHA-1 but not the others. This is a very valid question. – H. Green Feb 12 '10 at 19:02
  • A good way to further improve a digest, you might consider using one relatively expensive checksum (a SHA!), and a file size. The likelihood of an SHA hash conflict might occur on 2 files of the same size but different content is miniscule. – erik258 Dec 30 '13 at 21:01

5 Answers5

63

Assuming uniform spread in the range of MD5 and SHA-1 hashes for random strings (which isn't the case), and assuming we're only talking about two strings and not talking about a pool of strings (so we avoid birthday-paradox-type complexities):

An MD5 hash is 128 bits wide, and SHA-1's is 160. With the above assumptions, two strings A and B have a probability of colliding P if both hashes collide. So

P(both collide) = P(MD5 collides) * P(SHA-1 collides)

And

P(MD5 collides) = 1/(2^128)
P(SHA-1 collides) = 1/(2^160)

So

P(both) = 2^-128 * 2^-160 = 2^-288 ~= 2.01 x 10^-87

Again, if you have a pool of strings and you're trying to determine the probabilities of collisions with the pool, you're in the domain of the birthday paradox and this probability I've calculated here doesn't apply. That and hashes aren't as uniform as they should be. In reality you're going to have a much higher collision rate, but it will still be tiny.


EDIT

Since you are dealing with a birthday paradox situation, apply the same logic as the solution to the birthday paradox. Let's look at it from the point of view of just one hash function:

N := the number of hashes in your pool (several hundred million)
S := the size of your hash space (2^288)
Therefore,
P(There are no collisions) = (S!)/(S^N * (S - N)!)

Let's pretend we have a nice even number of hashes like 2^29 (roughly 530 million).

P = (2^288!)/(2^288^(2^29) * (2^288 - 2^29)!)

In short, I don't even want to think about calculating this number. I'm not even sure how you can go about estimating it. You'll at least need an arbitrary-precision calculator that can handle huge factorials without dying.

Note that the probabilities will follow a curve that starts at nearly 0 when N = 1 or 2, and it will reach 1 when N >= 2^288, similar in shape to the one on the Wikipedia page for the birthday paradox.

The birthday paradox reaches P = .5 when N = 23. In other words, the probability of a collision is 50% when N is 6% of S. If that scales (I'm not sure if it does), it means that there will be a 50% chance of a collision when you have 6% of 2^288 hashes. 6% of 2^288 is around 2^284. Your value of N (several hundred million) is nowhere near that. It's practically insignificant compared to your S, so I don't think you have anything to worry about. Collisions aren't very likely.

Welbog
  • 59,154
  • 9
  • 110
  • 123
  • 31
    There's one more assumption: that collisions in MD5 and SHA1 are independent. That is, that the two algorithms behave differently enough that a pair of strings that collide in MD5 are no more likely than usual to collide in SHA1. I think that's a safe assumption, even though the two algorithms have similar design. – Beta Aug 24 '09 at 15:33
  • 8
    Just to expand on Beta's statement. Welbog's analysis should be taken as a theoretical lower limit, the actual probability is guaranteed to be greater than or equal to that limit. Finding the actual true probability is cryptographically hard, you would actually have to fully crack both MD5 and SHA-1 to prove the actual probability. – Greg Miller Aug 24 '09 at 22:01
  • 4
    re: last paragraph: it doesn't scale linearly. Birthday paradox P=.5 goes roughly as sqrt(S), although off the top of my head, I can't find a reputable reference that states that. – Jason S Aug 24 '09 at 22:44
  • 1
    Even if it is sqrt(S), 2^29 is still insignificant compared to 2^144. But I accept that it's probably not linear. – Welbog Aug 24 '09 at 22:53
  • 1
    [http://www.wolframalpha.com/input/?i=2^(288!)/(2^(288^(2^29)) (2^288-2^29)!)](http://www.wolframalpha.com/input/?i=%282^288!%29%2F%282^288^%282^29%29+*+%282^288+-+2^29%29!%29) – Hello71 Jan 09 '11 at 23:58
  • 1
    @Beta not only "no more likely", but also "no less likely." – chacham15 Nov 07 '12 at 15:44
  • I think the Wolfram Alpha link above is missing some parentheses. Adding those, it seems even Wolfram Alpha cannot compute: https://www.wolframalpha.com/input/?i=%28%282%5E288%29!%29%2F%28%282%5E288%29%5E%282%5E29%29+*+%282%5E288+-+2%5E29%29!%29 except I can't figure out how to encode this link for SO. – ScottJ Jan 02 '13 at 23:00
  • The probabilities here appear to be off. According to [[1](https://www.schneier.com/blog/archives/2005/02/cryptanalysis_o.html)], SHA1 has a probability of 2^-69 for collision via bday attack. Similarly, the chance of collision for MD5 for "two blocks" is 2^-18 according to [[2](https://eprint.iacr.org/2013/170.pdf)]. These are vastly different numbers than what was used in Welbog's answer. Finally, it should be noted that this question is different but perhaps somewhat related to asking what the strength of a [combined MD5+SHA1 hash is](http://crypto.stackexchange.com/a/272) (hint: =~SHA1). – Greg Slepak Apr 11 '14 at 17:59
  • 1
    That is of course with way more than 2 messages (for bday attack). Going off of Welbog's math (if it's correct), then we need to figure out the number of messages needed for a 50% chance of collision. We take `S = 2^(69*2+18*2) = 2^174`, then 6% of that is 2^169 messages (as opposed to the 2^284), or 1.4e51 messages. At 1KB/message that's ~10^42 TB. That's still more data than we have hard disks (I think). – Greg Slepak Apr 11 '14 at 18:14
  • Playing some more with this, let's ignore disk space. Starting with 2^169 messages to look through, let's apply the current bitcoin network's hash rate of [46626.93 TH/sec](http://bitcoinwatch.com/) to this problem. Bitcoin uses SHA256, let's approximate this and assume the rate would be twice as high if they did SHA1. Then via wolframalpha we have [(2^169 / (2*46626.93 * 10^12)) seconds in years](http://www.wolframalpha.com/input/?i=%282%5E169+%2F+%282*46626.93+*+10%5E12%29%29+seconds+in+years) ~= 2.5e26 years to find a collision. So still secure (if I didn't mess up). – Greg Slepak Apr 11 '14 at 18:34
  • 1
    Now, that was using 6% of S. If we use sqrt of S that's 2^87 messages, several orders of magnitude different. Jason S asked for a reference, [here's one](https://wiki.engr.illinois.edu/download/attachments/183272958/crypto-hash.pdf?version=1&modificationDate=1315004095000). Let's say this is the accurate approximation and redo the math: [(2^87 / (2*46626.93 * 10^12)) seconds in years](http://www.wolframalpha.com/input/?i=%282%5E87+%2F+%282*46626.93+*+10%5E12%29%29+seconds+in+years) ~= 52 years for the birthday attack (50% of collision w/2^87 messages). – Greg Slepak Apr 11 '14 at 19:03
  • 1
    Given the above, it's clear that collisions can be found really quickly (way quicker than the 52 years shown above) by a super computer that produce collisions in both SHA1 and MD5 out of a universe of messages. To find a collision for a *specific* message, however, is orders of magnitude more difficult. – Greg Slepak Apr 11 '14 at 19:12
  • 1
    Looks like it's actually [2^61 for SHA1](http://code.google.com/p/hashclash/). So it's actually 2^61*2^18 = 2^79 messages. That places the actual answer (for *birthday attacks*) within a very reasonable time frame (couple months and shrinking every year) for super computers. – Greg Slepak Apr 12 '14 at 03:44
7

an addendum to Welbog's post:

Ratios of large factorials can be computed without using arbitrary-precision arithmetic, by using Stirling's approximation:

n! ≈ sqrt(2πn) * (n/e)n

So (S!)/(S^N * (S - N)!) ≈ sqrt(2πS)/sqrt(2π(S-N))*(S/e)S/((S-N)/e)S-N/SN

= sqrt(S/(S-N)) * (S/(S-N))S-N * e-N

= sqrt(1 + α) * (1 + α)S-N * e-N where α = N/(S-N) is small.

The approximation (1+a/n)nx ≈ eax holds as n → ∞ (or at least becomes very large)

** so this means (1+(N/(S-N)))S-N ≈ eN for S-N >> N.

So I would expect that

(S!)/(S^N * (S - N)!) ≈ sqrt(1 + N/(S-N)) * eN * e-N = sqrt(1 + N/(S-N)) for S-N >> N....

except this is greater than 1... so one of the approximations isn't good enough. :p

(** caveat: N/S has to be small: for N=22,S=365 this is off by a factor of 2)

Jason S
  • 184,598
  • 164
  • 608
  • 970
4

If message size is not restricted, the chance approaches 100% asymptotically, as there's an infinite number of possible messages and a finite number of possible hashes.

(note: edit to question makes this less relevant now)

ceejayoz
  • 176,543
  • 40
  • 303
  • 368
  • No. No matter how large the message is, it still hashes to a single MD5+SHA1 hash. – Captain Segfault Aug 24 '09 at 17:46
  • You're missing the point. There are a limited number of possible hashes, as they're finite length. There are an unlimited number of messages. Infinite messages plus finite hashes means infinite collisions. – ceejayoz Aug 24 '09 at 17:48
  • I think ceejayoz is missing the point, actually. In the question, it says, "More information: the size of the pool of possible messages is restricted, but large (several hundred million)." That is not the same as infinite. – Fantius Aug 24 '09 at 18:10
  • 1
    @fantius Question got edited. I even put a note in this answer pointing that out 19 minutes before you commented. – ceejayoz Aug 24 '09 at 18:12
  • OK, sorry. I was going based on the time of your last comment, which was after the time of the question edit. – Fantius Aug 24 '09 at 18:17
  • 1
    ceejayoz, I think this confusion arose because you said "message size" when you meant "the number of messages". – Beta Aug 24 '09 at 19:37
  • 2
    If message size is not limited, the number of messages is as a result not limited, as I can do "message", "messagemessage", "messagemessagemessage", and so on. Infinite message size logically leads to infinite number of messages. – ceejayoz Aug 24 '09 at 19:51
1

Generally, when one picks N elements randomly it is easier to compute the expected number of collision than the probability of a collision. Since the expected number of collisions cannot be smaller than the probability of a collision it can frequently be used as a suitable upper bound.

Assume that p is the probability that two randomly picked elements collide. If we pick N random elements then there are N*(N-1)/2 pair of elements and hence the expected number of collisions is

p * N * (N-1)/2.

E.g., if we assume that the probability for a collision for both MD5 and SHA1 is p=2-288 then even after randomly picking 2100 elements we still only expect about 2-89 collisions.

Another example: if we pick 230 random elements and only compute the MD5. Assuming that a collision between two MD5 hashes is p=2-128 this gives an expected number of 2-59 for the number of collisions. Hence even the probability that the MD5 hash collides for two inputs is already very small.

Accipitridae
  • 3,136
  • 19
  • 9
1

The chosen answer is incorrect because it uses the wrong probabilities. I spent a good portion of today researching this (you can sorta see my thought process in the comments to that answer), and believe the actual answer is the following (for birthday attack of slightly larger messages than the ones you're talking about):

2^-61 * 2^-18 = a collision in once in 2^79.

And that's if it's OK to just multiply these probabilities (I'm unsure of that).

This is doable (less than a couple months and dropping every year) by super computers today.

Note that this is based on sufficiently large pools of messages (to make the birthday paradox meaningful). This is also the scenario you said you were concerned about.

Now, a different situation is finding a collision for a pair of hashes (SHA1 and MD5) of a specific message. This takes you out of bday paradox territory and is orders of magnitude more difficult. I'm not sure if that is 2^(-61*2)*2^(-18*2) or something else. If anyone knows what that is, please post a comment to this answer (would be super appreciated!).

Now you ask:

Given two different messages, A and B (maybe 20-80 characters of text, if size matters at all)

Yes, size does matter. Click the link to the 2^-18 figure and you'll see that value is for two input blocks. In MD5, an input block is 512 bytes. 20-80 characters of text is too small for that, and the single-block value is 2^41.

Thus, for that amount of data, you get 2^-61 (I think) * 2^-41 = 2^-102.

So for that size it seems safe (link contains figure of twice current bitcoin hashrate of SHA256: 46626.93 TH/sec).

Greg Slepak
  • 1,613
  • 16
  • 17