6

I've searched a lot for MD5 hash collision, but I've found binary examples only. I would like to find two UTF-8 strings, which have the same MD5 hash. Are there any, or does the collision only work for binary data?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Iter Ator
  • 8,226
  • 20
  • 73
  • 164
  • 8
    There surely are, whether you would want to spent the time finding them is a different question. You should just take into account that 2**128 is a quiet large number. Reference: http://en.wikipedia.org/wiki/MD5 – alk Oct 06 '14 at 13:29
  • 2
    Possible duplicate of http://stackoverflow.com/q/1999824/3194340 – n0p Oct 13 '14 at 14:33
  • 20
    This question is being discussed on [meta](https://meta.stackoverflow.com/questions/412595). – cigien Oct 27 '21 at 14:57
  • 6
    There's a whole question about this on [crypto.stackexchange.com](https://crypto.stackexchange.com/questions/1434/are-there-two-known-strings-which-have-the-same-md5-hash-value) which is a much more appropriate location for this kind of question. – Martin Oct 28 '21 at 19:07
  • 1
    @Martin The answers provide binary examples only, not real UTF8 strings – Iter Ator Oct 28 '21 at 19:31
  • 6
    @IterAtor The point being not that the question was a duplicate, but rather that it belongs on that site instead of this one – Martin Oct 28 '21 at 20:26
  • 10
    @IterAtor still not a programming question. Did you try to write code to convert that binary data to UTF8 strings? If you did and have problems making that work, then you should write a new question with that code so we can help. – Leonardo Herrera Oct 29 '21 at 14:34
  • 2
    This question is being discussed on [meta](https://meta.stackoverflow.com/q/412697/16886597), a second time – cocomac Nov 01 '21 at 02:20

3 Answers3

15

It's definitely possible:

  • We all agree there are collisions for MD5 due to the birthday paradox - we are mapping infinitely many possible inputs to elements belonging to a finite sequence.
  • There is a solid chance that there are infinitely many collisions: we are able to produce infinite pairs of input and MD5 tries to map them uniformly.

By that alone some of these collisions are bound to be valid UTF-8 strings, but they're extremely rare, since most of these will be just random binary garbage.

If you absolutely need to find such messages, I recommend using collision finder written by Patrick Stach, which should return pair of arbitrary messages within a few hours, or my attempt to improve it. The latter uses techniques presented in later papers by Wang (the first person to demonstrate examples of MD5 collisions), Lian, Sasaki, Yajima and Klima.

I think you could also use length extension attack to some extent, but it requires deeper understanding of what happens inside MD5.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
rr-
  • 14,303
  • 6
  • 45
  • 67
  • 1
    Is it really the "birthday paradox"? I think it should be the [pigeonhole principle](https://en.wikipedia.org/wiki/Pigeonhole_principle) (which is also linked from the birthday paradox wiki page). – Eric Duminil Nov 02 '21 at 08:01
9

There are UTF-8 collisions. By the nature of cryptographic hashes, finding them is intentionally difficult, even for a hash as broken as MD5.

You might search for MD5 Rainbow Tables, which can be used for password cracking, and hence for UTF-8 strings. As @alk pointed out, a brute force search is going to take a very long time.

rossum
  • 15,344
  • 1
  • 24
  • 38
  • Whoever is bumping up the score for this answer, please don't. It is old, and deserves to be left in peace. – rossum Nov 01 '21 at 23:19
-5

The canonical example of an MD5 hash collision (hex - from here):

Message 1:

d131dd02c5e6eec4693d9a0698aff95c 2fcab58712467eab4004583eb8fb7f89
55ad340609f4b30283e488832571415a 085125e8f7cdc99fd91dbdf280373c5b
d8823e3156348f5bae6dacd436c919c6 dd53e2b487da03fd02396306d248cda0
e99f33420f577ee8ce54b67080a80d1e c69821bcb6a8839396f9652b6ff72a70

Message 2

d131dd02c5e6eec4693d9a0698aff95c 2fcab50712467eab4004583eb8fb7f89
55ad340609f4b30283e4888325f1415a 085125e8f7cdc99fd91dbd7280373c5b
d8823e3156348f5bae6dacd436c919c6 dd53e23487da03fd02396306d248cda0
e99f33420f577ee8ce54b67080280d1e c69821bcb6a8839396f965ab6ff72a70

are, in fact, valid UTF-8 strings. They do not contain any NULL bytes and are therefore UTF-8 strings. Now, they are meaningless and look like garbage when decoded:

Message 1:

1i=\/ʵF~@X>U4 䈃%qAZQ%ɟ7<[؂>1V4[m6Sⴇ9cH͠3BW~Tp
Ƙ!e+o*p

(some characters were control characters)

Message 2:

1i=\/ʵF~@X>U4    䈃%AZQ%ɟr7<[؂>1V4[m6S49cH͠3BW~Tp(
Ƙ!eo*p

(same situation)


Oh, and before I forget, here's the MD5 hash:

79054025255fb1a26e4bc422aef54eb4
haneefmubarak
  • 1,911
  • 1
  • 21
  • 32
  • I don't believe that site will accept hex encoded data. And the UTF string isn't exactly perfectly printable. Try doing it using `xxd` and `md5sum` in a *nix shell terminal pipeline. – haneefmubarak Oct 14 '14 at 07:46
  • 14
    These are obviously not valid UTF-8. I assume your UTF-8 decoder accepts invalid byte sequences. `31 dd 02` can't be valid, since UTF-8 will never produce a single byte with the high bit set surrounded by bytes with the high bit unset. – CodesInChaos Oct 14 '14 at 09:29