23

I have an UTF-8 string and I need to get the byte array of UTF-16 encoding, so how can I convert my string to UTF-16 byte array?

Update:
I mean we have Encoding.Unicode.GetBytes() or even Encoding.UTF8.GetBytes() function to get byte array of strings, what about UTF-16? We don't have any Encoding.UTF16.GetBytes() so how can I get the byte array?

Afshin Mehrabani
  • 33,262
  • 29
  • 136
  • 201
  • 2
    you probably either have a string or an UTF-8 byte array. String is a type that contains characters, regardless of encoding, as encoding is only for byte array representation – njzk2 Sep 09 '13 at 12:14
  • also, what have you tried, and please post your code – njzk2 Sep 09 '13 at 12:15
  • What do you mean by "I have an UTF-8 string" to start with? If you have an instance of System.String, it will be UTF-16 in memory already. – Jon Skeet Sep 09 '13 at 12:15
  • A string in C# is always `UTF-16`, I believe there is no way to _convert_ it. What is your `UTF-8` string looks like? – Soner Gönül Sep 09 '13 at 12:15
  • @njzk2: Not quite: strings are sequences of UTF-16 code units. That's important in the case of non-BMP characters. – Jon Skeet Sep 09 '13 at 12:15
  • http://stackoverflow.com/questions/472906/net-string-to-byte-array-c-sharp "Internally, the .NET framework uses UTF16 to represent strings, so if you simply want to get the exact bytes that .NET uses, use System.Text.Encoding.Unicode.GetBytes". Regardless, that answer contains the code to get the UTF-16 representation. – Jeroen Vannevel Sep 09 '13 at 12:16
  • @JonSkeet : the internal array that backs the string is indeed in utf16, but the string itself doesn't have a notion of encoding. The encoding will be relevant only when converting to/from a byte array – njzk2 Sep 09 '13 at 12:17
  • 2
    @AfshinMehrabani: from the Encoding MSDN: `Unicode Gets an encoding for the UTF-16 format using the little endian byte order.` http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx – Jeroen Vannevel Sep 09 '13 at 12:22

2 Answers2

35

For little-endian UTF-16, use Encoding.Unicode.

For big-endian UTF-16, use Encoding.BigEndianUnicode.

Alternatively, construct an explicit instance of UnicodeEncoding which allows you to specify the endianness, whether or not to include byte-order marks, and whether to throw an exception on invalid data.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
11

I have an UTF-8 string and ...

No you don't. That's not possible. You may have a sequence (array or stream) of bytes that hold UTF-8 encoded text. But not a string.

A .net string always contains Unicode (or more precisely, UTF-16).

..., so how can I convert my string to UTF-16 byte array?

string myText = ...;  // some string, maybe from an UTF8 file or any other source
byte[] utf16Data = Encoding.Unicode.GetBytes(mytext);

The library defines the range UTF7, UTF8, Unicode, UTF32. Unicode is UTF16 in the context of the .NET framework.

H H
  • 263,252
  • 30
  • 330
  • 514
  • 6
    Unicode is UTF-16 ... yes, in the dialect of .NET. For the rest of the world, Unicode is an enumeration of characters (codepoints), and UTF-16 is one implementation of this enumeration on at least 2 bytes – njzk2 Sep 09 '13 at 12:26