Example of hash collision (printable strings)

Question

I've searched a lot for MD5 hash collision, but I've found binary examples only. I would like to find two UTF-8 strings, which have the same MD5 hash. Are there any, or does the collision only work for binary data?

There surely are, whether you would want to spent the time finding them is a different question. You should just take into account that 2**128 is a quiet large number. Reference: http://en.wikipedia.org/wiki/MD5 — alk, Oct 06 '14 at 13:29
Possible duplicate of http://stackoverflow.com/q/1999824/3194340 — n0p, Oct 13 '14 at 14:33
This question is being discussed on [meta](https://meta.stackoverflow.com/questions/412595). — cigien, Oct 27 '21 at 14:57
There's a whole question about this on [crypto.stackexchange.com](https://crypto.stackexchange.com/questions/1434/are-there-two-known-strings-which-have-the-same-md5-hash-value) which is a much more appropriate location for this kind of question. — Martin, Oct 28 '21 at 19:07
@Martin The answers provide binary examples only, not real UTF8 strings — Iter Ator, Oct 28 '21 at 19:31
@IterAtor The point being not that the question was a duplicate, but rather that it belongs on that site instead of this one — Martin, Oct 28 '21 at 20:26
@IterAtor still not a programming question. Did you try to write code to convert that binary data to UTF8 strings? If you did and have problems making that work, then you should write a new question with that code so we can help. — Leonardo Herrera, Oct 29 '21 at 14:34
This question is being discussed on [meta](https://meta.stackoverflow.com/q/412697/16886597), a second time — cocomac, Nov 01 '21 at 02:20

score 15 · Answer 1 · edited Nov 04 '21 at 02:09

It's definitely possible:

We all agree there are collisions for MD5 due to the birthday paradox - we are mapping infinitely many possible inputs to elements belonging to a finite sequence.
There is a solid chance that there are infinitely many collisions: we are able to produce infinite pairs of input and MD5 tries to map them uniformly.

By that alone some of these collisions are bound to be valid UTF-8 strings, but they're extremely rare, since most of these will be just random binary garbage.

If you absolutely need to find such messages, I recommend using collision finder written by Patrick Stach, which should return pair of arbitrary messages within a few hours, or my attempt to improve it. The latter uses techniques presented in later papers by Wang (the first person to demonstrate examples of MD5 collisions), Lian, Sasaki, Yajima and Klima.

I think you could also use length extension attack to some extent, but it requires deeper understanding of what happens inside MD5.

Is it really the "birthday paradox"? I think it should be the [pigeonhole principle](https://en.wikipedia.org/wiki/Pigeonhole_principle) (which is also linked from the birthday paradox wiki page). — Eric Duminil, Nov 02 '21 at 08:01

score 9 · Answer 2 · answered Oct 06 '14 at 19:12

9

There are UTF-8 collisions. By the nature of cryptographic hashes, finding them is intentionally difficult, even for a hash as broken as MD5.

You might search for MD5 Rainbow Tables, which can be used for password cracking, and hence for UTF-8 strings. As @alk pointed out, a brute force search is going to take a very long time.

answered Oct 06 '14 at 19:12

rossum

15,344
1
24
38

Whoever is bumping up the score for this answer, please don't. It is old, and deserves to be left in peace. – rossum Nov 01 '21 at 23:19

score -5 · Answer 3 · answered Oct 14 '14 at 06:35

The canonical example of an MD5 hash collision (hex - from here):

Message 1:

d131dd02c5e6eec4693d9a0698aff95c 2fcab58712467eab4004583eb8fb7f89
55ad340609f4b30283e488832571415a 085125e8f7cdc99fd91dbdf280373c5b
d8823e3156348f5bae6dacd436c919c6 dd53e2b487da03fd02396306d248cda0
e99f33420f577ee8ce54b67080a80d1e c69821bcb6a8839396f9652b6ff72a70

Message 2

d131dd02c5e6eec4693d9a0698aff95c 2fcab50712467eab4004583eb8fb7f89
55ad340609f4b30283e4888325f1415a 085125e8f7cdc99fd91dbd7280373c5b
d8823e3156348f5bae6dacd436c919c6 dd53e23487da03fd02396306d248cda0
e99f33420f577ee8ce54b67080280d1e c69821bcb6a8839396f965ab6ff72a70

are, in fact, valid UTF-8 strings. They do not contain any NULL bytes and are therefore UTF-8 strings. Now, they are meaningless and look like garbage when decoded:

Message 1:

1i=\/ʵF~@X>U4 䈃%qAZQ%ɟ7<[؂>1V4[m6Sⴇ9cH͠3BW~Tp
Ƙ!e+o*p

(some characters were control characters)

Message 2:

1i=\/ʵF~@X>U4    䈃%AZQ%ɟr7<[؂>1V4[m6S49cH͠3BW~Tp(
Ƙ!eo*p

(same situation)

Oh, and before I forget, here's the MD5 hash:

79054025255fb1a26e4bc422aef54eb4

I don't believe that site will accept hex encoded data. And the UTF string isn't exactly perfectly printable. Try doing it using `xxd` and `md5sum` in a *nix shell terminal pipeline. — haneefmubarak, Oct 14 '14 at 07:46
These are obviously not valid UTF-8. I assume your UTF-8 decoder accepts invalid byte sequences. `31 dd 02` can't be valid, since UTF-8 will never produce a single byte with the high bit set surrounded by bytes with the high bit unset. — CodesInChaos, Oct 14 '14 at 09:29

Example of hash collision (printable strings)

3 Answers3

Message 1:

Message 2

Message 1:

Message 2: