8

We are having trouble getting a Unicode string to convert to a UTF-8 string to send over the wire:

// Start with our unicode string.
string unicode = "Convert: \u10A0";

// Get an array of bytes representing the unicode string, two for each character.
byte[] source = Encoding.Unicode.GetBytes(unicode);

// Convert the Unicode bytes to UTF-8 representation.
byte[] converted = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, source);

// Now that we have converted the bytes, save them to a new string.
string utf8 = Encoding.UTF8.GetString(converted);

// Send the converted string using a Microsoft function.
MicrosoftFunc(utf8);

Although we have converted the string to UTF-8, it's not arriving as UTF-8.

Ryall
  • 12,010
  • 11
  • 53
  • 77

1 Answers1

13

After a much troubled and confusing morning, we found the answer to this problem.

The key point we were missing, which was making this very confusing, was that string types are always encoded in 16-bit (2-byte) Unicode. This means that when we do a GetString() on the bytes, they are automatically being re-encoded into Unicode behind the scenes and we are no better off than we were in the first place.

When we started to get character errors, and double byte data at the other end, we knew something was wrong but at a glance of the code we had, we couldn't see anything wrong. After learning what we have explained above, we realised that we needed to send the byte array if we wanted to preserve the encoding. Luckily, MicrosoftFunc() had an overload which was able to take a byte array instead of a string. This meant that we could convert the unicode string to an encoding of our choice and then send it off exactly as we expect it. The code changed to:

// Convert from a Unicode string to an array of bytes (encoded as UTF8).
byte[] source = Encoding.UTF8.GetBytes(unicode); 

// Send the encoded byte array directly! Do not send as a Unicode string.
MicrosoftFunc(source);

Summary:

So in conclusion, from the above we can see that:

  • GetBytes() amongst other things, does an Encoding.Convert() from Unicode (because strings are always Unicode) and the specified encoding the function was called from and returns an array of encoded bytes.
  • GetString() amongst other things, does an Encoding.Convert() from the specified encoding the function was called from to Unicode (because strings are always Unicode) and returns it as a string object.
  • Convert() actually converts a byte array of one encoding to another byte array of another encoding. Obviously strings cannot be used (because strings are always Unicode).
Ryall
  • 12,010
  • 11
  • 53
  • 77
  • 10
    There is some confusion here. There is no encoding called Unicode. Unicode is the name of a character set, which can be encoded in bytes using an encoding, for example UTF-8 or UTF-16. Thus `Encoding.Unicode` is severely misnamed, since it implements little-endian UTF-16 encoding. It should really have been called `Encoding.UTF16LE`. Strings are sequences of characters, and what encoding they're stored as in the underlying platform is irrelevant. It's an implementation detail that they happen to be stored as UTF-16. – Christoffer Hammarström Jun 23 '11 at 14:23
  • There is nothing wrong with calling it `Encoding.Unicode`, at some level Unicode is an encoding. The fact that a platform chooses to use UTF-16 or UTF-8 is just an implementation detail. When you use the string, it doesn't really matter what encoding it has internally. As long as the platform provides method to encode in an out, you don't necessarily even have to know what the internal encoding is at all. Some languages, python for example, don't say any encoding at all in the API, they just call it "a string" and you encode to and decode from that, that's an even cleaner approach. – thnee Aug 31 '16 at 11:26