10

I'm porting a C# script into Spark (Scala) and I'm running into an issue with UUID generation in Scala vs GUID generation in C#.

Is there any way to generate a UUID in Java that is identical to that of the one generated in C#?

I'm generating the primary key for a database by creating a Guid from the MD5 hash of a string. Ultimately, I'd like to generate UUIDs in Java/Scala that match those from the C# script, so the existing data in the database that used the C# implementation for hashing doesn't need to be rehashed.

C# to port:

String ex = "Hello World";
Console.WriteLine("String to Hash: {0}", ex);
byte[] md5 = GetMD5Hash(ex);
Console.WriteLine("Hash: {0}", BitConverter.ToString(md5));
Guid guid = new Guid(md5);
Console.WriteLine("Guid: {0}", guid);

private static byte[] GetMD5Hash(params object[] values) {
  using (MD5 md5 = MD5.Create())
    return md5.ComputeHash(Encoding.UTF8.GetBytes(s));
} 

Scala ported code:

val to_encode = "Hello World"
val md5hash = MessageDigest.getInstance("MD5")
 .digest(to_encode.trim().getBytes())
val md5string = md5hash.map("%02x-".format(_)).mkString
val uuid_bytes = UUID.nameUUIDFromBytes(to_encode.trim().getBytes())
printf("String to encode: %s\n", to_encode)
printf("MD5: %s\n", md5string)
printf("UUID: %s\n", uuid_bytes.toString)

Result from C#

  • String to hash: Hello World
  • MD5: B1-0A-8D-B1-64-E0-75-41-05-B7-A9-9B-E7-2E-3F-E5
  • Guid: b18d0ab1-e064-4175-05b7-a99be72e3fe5

Result from Scala

  • String to hash: Hello World
  • MD5: b10a8db164e0754105b7a99be72e3fe5
  • UUID: b10a8db1-64e0-3541-85b7-a99be72e3fe5

What works:

  • MD5 Hashes (which the GUID and UUID are based off of) match

What doesn't:

  • First three fields have endianness switched in C# (orange)
    • C#'s GUID chooses native byte ordering for the first three fields (4, 2, 2), which in this case is little endian and Big Endian for the last field (8), while Java's UUID uses Big Endian ordering for all four fields; this explains the byte ordering in the first three fields in C#.
  • Fourth and fifth bytes are different (red)
    • Java switches 6-7 bits in order to denote version and variant of UUID, this might explain the differences in bytes 4 and 5. This seems to be the roadblock.
  • I understand that Java uses signed bytes, while C# has unsigned bytes; this might be relevant as well.

Short of manipulating bytes, is there any other way to fix this?

Ari Krumbein
  • 103
  • 1
  • 5
  • 1
    @JoeC Read the full question, I thought that the first reading the title, but if you read the entire question makes sense, hes' constructing the GUID in base of an MD5 hash. – Gusman Jul 26 '17 at 21:26
  • as an aside, databases generally don't do well with UUID primary keys unless they are specifically sequential. – Crowcoder Jul 26 '17 at 21:28
  • That is not universally true @Crowcoder . Cassandra would likely be a counter-example. – mjwills Jul 26 '17 at 22:09
  • @mjwills hence "generally". OP does not specify if the db is relational. – Crowcoder Jul 26 '17 at 22:13
  • 1
    Are you sure it's `UUID.nameUUIDFromBytes(to_encode.trim().getBytes())` in the Scala example? In the C# example you used the hash as an input for the Guid. – Bernhard Hiller Jul 27 '17 at 10:32
  • On the C# side, the relevant code calls the custom Guid constructor that uses a byte[], in this case, the MD5 hash. Since I wanted to emulate the behavior on the Scala side, I figured a version 3 UUID that used MD5 hashing would be the closest thing. Do you have a different way of going about this? – Ari Krumbein Jul 27 '17 at 16:33
  • That works! Thanks. Do I upvote this comment or the one below? – Ari Krumbein Jul 28 '17 at 17:54
  • I will update my answer to include it @AriKrumbein - feel free to upvote it and accept it. – mjwills Jul 28 '17 at 23:37

1 Answers1

7

TL;DR

If you want your C# and your Java to act exactly the same way (and you are happy with the existing C# behaviour), you'll need to manually re-order some of the bytes in uuid_bytes (i.e. swap some of the entries you identified as out of order).

Additionally, you should not use:

UUID.nameUUIDFromBytes(to_encode.trim().getBytes())

But instead use:

public static String getGuidFromByteArray(byte[] bytes) {
    ByteBuffer bb = ByteBuffer.wrap(bytes);
    long high = bb.getLong();
    long low = bb.getLong();
    UUID uuid = new UUID(high, low);
    return uuid.toString();
}

Shamelessly stolen from https://stackoverflow.com/a/24409153/34092 :)

Additional Background

In case you weren't aware, when dealing with C#'s GUIDs:

Note that the order of bytes in the returned byte array is different from the string representation of a Guid value. The order of the beginning four-byte group and the next two two-byte groups is reversed, whereas the order of the last two-byte group and the closing six-byte group is the same. The example provides an illustration.

And:

The order of hexadecimal strings returned by the ToString method depends on whether the computer architecture is little-endian or big-endian.

In your C#, rather than using:

Console.WriteLine("Guid: {0}", guid);

you may want to consider using:

Console.WriteLine(BitConverter.ToString(guid.ToByteArray()));

Your existing code calls ToString behind the scenes. Alas, ToString and ToByteArray do not return the bytes in the same order.

mjwills
  • 23,389
  • 6
  • 40
  • 63
  • I should have added: the C# is a system I (generally) cannot modify. Thanks for helping out with this. If you have any advice on the second issue, I would be grateful. – Ari Krumbein Jul 26 '17 at 22:03
  • Thanks. Let me know if you figure out the version and variant bits problem. – Ari Krumbein Jul 26 '17 at 22:30
  • That is likely since in the C# and in the Scala you are generating the UUID in two different ways. Does https://gist.github.com/jeffjohnson9046/c663dd22bbe6bb0b3f5e help? – mjwills Jul 28 '17 at 10:21
  • FYI @mjwills, the caveat here is that because of the lack of version and variant bits, I'm pretty sure these aren't valid Java UUIDs, per se, which isn't relevant for me, but might be for others. Might want to include in your answer. – Ari Krumbein Jul 29 '17 at 00:29
  • What makes you believe they aren't valid Java UUIDs? Saying they aren't the exact same format as what Java would _generate_ is perhaps true. But that is different to saying they aren't _valid_. – mjwills Jul 29 '17 at 01:21