0

I have two different files: code1.c and code2.c, both of them already contain some C code (files contents differ).

I would like to compile both files preferably with the same compilation flags so I will end up with two binaries which would have same size and same md5 hash.

Note: It is allowed to add extra dead/junk code to one of the files. I think gcc compiler version should not matter? (I would use gcc version ≥ 7)

How can I achieve that?


I found some articles which show that md5 hash collision is possible, but the problem is that it should result in the same file size:

Awaaaaarghhh
  • 191
  • 3
  • 16
  • Although it is conceivable that there is a solution, it is highly unlikely that you can find one in practice. It is hard enough to find a hash collision when you can modify one input arbitrarily -- that's a main point of hashing -- but you cannot even do that, since the inputs to your MD5 are outputs from the compiler. It is in fact possible that there is no solution given your starting point and constraints. – John Bollinger Apr 24 '20 at 23:41
  • @JohnBollinger would it be easier if i would say that i would compile without any optimisation? – Awaaaaarghhh Apr 24 '20 at 23:45
  • No, not particularly. – John Bollinger Apr 24 '20 at 23:46
  • @JohnBollinger and what if only md5 hash should be the same (compile without optimisation, file size can be different). Is this than doable? E.g. given `md5 hash` of file `binary1` and source code `code2.c` where compiled `code2.c` (`binary2`) should have same `md5 hash` as file `binary1` ? – Awaaaaarghhh Apr 25 '20 at 00:15
  • The second link you posted is *exactly* what you are looking for. – Marco Bonelli Apr 25 '20 at 00:56
  • @JohnBollinger you can add dead code to the already compiled binaries (preferably near the start of the file), that would definitely grant a solution, possibly even quite rapidly. – Marco Bonelli Apr 25 '20 at 00:58
  • @MarcoBonelli, that does not seem to be within the constraints outlined by the OP. I read the OP to be saying that the binaries *as emitted by the compiler* must have the same hash (not to mention being the same size). I take the comment about adding junk to one of the files as referring to the source files, not the resulting binaries. – John Bollinger Apr 25 '20 at 01:04
  • @MarcoBonelli i would like to have `md5-file-hash` of compiled `code2.c` (`binary2`) exactly the same as `md5-file-hash` of `binary1`. It is not clear to me how i can adapt solution from the *second link* (natmchugh.blogspot.com) to my question. Can you give me a hint? ... or better write a full answer? – Awaaaaarghhh Apr 25 '20 at 01:04
  • @JohnBollinger resulting binary might have also some junk added to it (just some `NOPs` opt codes added ... or something similar), if it works, it would be ok too ... – Awaaaaarghhh Apr 25 '20 at 01:06
  • 1
    JonBollinger is right that your question seemed to point out that you would only be constrained to change *the source code* before compiling. That would probably make things unfeasible. The only way to make this work is to alter the binaries after compilation, and as I said the second link you posted does just this: it adds an useless variable to the code, then compiles it and precalculates the MD5 state up until the variable, then bruteforces the needed value of the variable needed to have a collision. – Marco Bonelli Apr 25 '20 at 01:18
  • How about adding big chunks like this to your functions? `asm volatile("jmp foo\n#a bunch of asm-encoding of arbitrary binary stuff here\nfoo:": : :)` That gives you pretty much raw access to the middle of the binary, from the source. – Joseph Sible-Reinstate Monica Apr 25 '20 at 04:39

0 Answers0