0

I'm struggling with difference in MD5 result consumption in Kotlin (Java) and C#. I've found this article thats suggests solution:

How can you generate the same MD5 Hashcode in C# and Java?

But I would like to understand logic behind this. I've done couple of tests. C#:

var data = MD5.Create().ComputeHash(Encoding.UTF8.GetBytes("123456"));
var s = Encoding.UTF8.GetString(data, 0, data.Length);

Produces following byte sequence (data variable):

 225, 10, 220, 57, 73, 186, 89, 171, 190, 86, 224, 87, 242, 15, 136, 62

If I use Kotlin (Java):

val md = MessageDigest.getInstance("MD5")
val data = md.digest("123456".toByteArray())

val s = String(data)

val ls2 = data.map { x-> x.toUByte() }

So Java has bytes with sign, and c# unsigned (ls2 - contains same unsigned bytes as c# example). Fine. I would like to get string value - I convert both byte arrays to string and I got different strings (s variable). What do I miss?

Thanks.

Pavel
  • 653
  • 2
  • 11
  • 31

1 Answers1

5

In C#, you try and use UTF-8 encoding to turn your bytes into a string. However, this is a very bad idea -- there are many sequences of bytes which aren't valid in a UTF-8-encoded string, and further sequences which will result in unprintable characters. If the encoder encounters a sequence of bytes which don't form a valid UTF-8-encoded character (and it will do, because you're not doing anything to ensure that your sequence of bytes is a valid UTF-8-encoded string), it will insert a replacement character.

In Kotlin, you use new String(byte[]) which uses your system's encoding. You've got a similar problem here: although most bytes will result in a valid character, some of those characters will be unprintable.

So you're using two different encodings for C# and Kotlin (hence different results), but you're also doing something which will probably give you unprintable characters, or might replace sequences of bytes with a replacement character (so different MD5 hashes will look the same).

(Note that "unprintable characters" might just not be visible, but they might do strange things like reverse the direction of text on that page, or start joining together the characters around them!)

You'd be better off turning your bytes into a base64 string, or a sequence of hex characters. Both of these make sure that every possible sequence of bytes gets turned into printable characters, in a way which is consistent across different languages.

For C#, use Convert.ToBase64String(data) to get a base64-encoded string, and BitConverter.ToString(data).Replace("-","") to get a hex-encoded string (although there are many way to do this).

For Kotlin, use Base64.getEncoder().encodeToString(data) to get a base64-encoded string, and data.joinToString("") { "%02x".format(it) } to get a hex-encoded string.

canton7
  • 37,633
  • 3
  • 64
  • 77
  • 1
    "You cannot take any old byte sequence and turn it into a string." - yes you can, you just can't treat it *as a string in an arbitrary encoding*. Converting it into base64 is still "turning it into a string". Without the notion of an encoding, there's no concept of "a valid string". (And indeed any byte sequence *is* a valid ISO-8859-1 string.) I wholeheartedly agree with the thrust of the answer, but can't in good conscience upvote it until the first paragraph is sorted :) – Jon Skeet Oct 21 '19 at 14:22
  • @JonSkeet Hopefully that's a bit more correct, thanks for the feedback – canton7 Oct 21 '19 at 14:31