2

The code cited is from this answer, but similar code is just about everywhere. Suppose we need to hash a C# string content using an implementation of System.Security.Cryptography.HashAlgorithm.ComputeHash() method that accepts byte[]. The typical code goes like this:

public static byte[] GetHash(string inputString)
{
    HashAlgorithm algorithm = MD5.Create();  // SHA1.Create()
    return algorithm.ComputeHash(Encoding.UTF8.GetBytes(inputString));
}

Strings are stored as Unicode internally.

Why is Encoding.UTF8 used instead of Encoding.Unicode?

Community
  • 1
  • 1
sharptooth
  • 167,383
  • 100
  • 513
  • 979
  • 1
    Sorry, but I can't see how an objective answer to this question could exist. – Jon Apr 02 '14 at 09:19
  • @Jon: Something like "that really makes no sense" or "if you use `Encoding.Unicode` these and these bad things happen". – sharptooth Apr 02 '14 at 09:20
  • None of the above. It's an arbitrary choice. The only way in which the actual choice matters is that *all* code that computes these hashes must use the same encoding otherwise they will obviously hash the same input to different values. – Jon Apr 02 '14 at 09:29
  • Based on [this](http://stackoverflow.com/a/10380166/578411) I would say that for the purpose of getting the hash the encoding is not needed and a waste of cpu cycles. – rene Apr 02 '14 at 10:24
  • Including a lot of zeros in the hash calculation makes it less secure. Utf-8 won't encode zeros like utf-16 does on text that uses characters from a Latin alphabet. – Hans Passant Apr 02 '14 at 11:47
  • See utf8everywhere.org. There's nothing more to add. – Pavel Radzivilovsky Apr 03 '14 at 16:56

1 Answers1

4

Why is Encoding.UTF8 used instead of Encoding.Unicode?

Because that's the encoding that most other application frameworks that have made a choice use for hashes. Outside the .NET world, UTF-16LE encoding (which is what the misnamed “Unicode” encoding actually is) is not necessarily a natural choice for string storage. If you use something other than UTF-8 you won't be able to interoperate with those hashes generated from other systems.

Crucially, UTF-8 is ASCII-compatible: for ASCII-only input data this will generate matching hashes to all the software out there that works with encoding-ignorant byte strings. That includes a lot of PHP webapps, Java apps that call naïve String.getBytes and so on.

So using UTF-8 means you get full interop with everything modern that uses UTF-8 and partial interop with pretty much everything else. Using UTF-16 would give you hashes that didn't match anyone else's.

You can still do it if you are sure you will only ever use the hashes internally, but it doesn't really win you anything. Any savings you made from not encoding to UTF-8 would likely be negated by having to hash a longer input sequence, because for the most-likely-to-occur ASCII characters, UTF-8 is a much more efficient representation than UTF-16.

bobince
  • 528,062
  • 107
  • 651
  • 834