Defining 4-byte UTF-16 character in a string

Question

I have read a question about UTF-8, UTF-16 and UCS-2 and almost all answers give the statement that UCS-2 is obsolete and C# uses UTF-16.

However, all my attempts to create the 4-byte character U+1D11E in C# failed, so I actually think C# uses the UCS-2 subset of UTF-16 only.

There are my tries:

string s = "\u1D11E"; // gives the 2 character string "ᴑE", because \u1D11 is ᴑ
string s = (char) 0x1D11E; // won't compile because of an overflow
string s = Encoding.Unicode.GetString(new byte[] {0xD8, 0x34, 0xDD, 0x1E}); // gives 㓘ờ

Are C# strings really UTF-16 or are they actually UCS-2? If they are UTF-16, how would I get the violin clef into my C# string?

The simplest thing is to just include the character in the source code, that is `string s = "";`. I suggest you save your `.cs` file with UTF-8 encoding. This character in the *Supplementary Multilingual Plane* will take up four octets in UTF-8. When held in memory it will take up two UTF-16 code units, or `char` values, a so-called surrogate pair. — Jeppe Stig Nielsen, Jan 02 '14 at 00:10
Yes, I read about that on Wikipedia and that's why I tried the Encoding.GetString() method. — Thomas Weller, Jan 02 '14 at 00:11

score 16 · Accepted Answer · answered Jan 02 '14 at 00:00

16

Use capital U instead:

  string s = "\U0001D11E";

And you overlooked that most machines are little-endian:

  string t = Encoding.Unicode.GetString(new byte[] { 0x34, 0xD8, 0x1E, 0xDD });

answered Jan 02 '14 at 00:00

Hans Passant

922,412
146
1,693
2,536

I absolutely like that you found my bug in the byte-by-byte encoding. Although other answers also found the capital U solution, this is the reason why I accept your answer. – Thomas Weller Jan 02 '14 at 00:07
1

But unless your `.cs` source file is saved in some 1-byte "ANSI" codepage, you should consider simply doing `string s = "";`. That is pretty natural. – Jeppe Stig Nielsen Jan 02 '14 at 00:13

Joni · Answer 2 · 2014-01-02T00:02:42.977

5

C# definitely uses UTF-16. The correct way to define characters above the U+0000 - U+FFFF range is using the escape sequence that permits defining characters using 8 hex digits:

string s = "\U0001D11E";

If you use \u1D11E it's interpreted as the U+1D11 character followed by an E.

One thing to keep in mind when using these characters is that the String.Length property and most string methods work on UTF-16 code units, not Unicode characters. From MSDN documentation:

The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

edited Jan 02 '14 at 00:02

answered Jan 01 '14 at 23:48

Joni

108,737
14
143
193

+1 because I didn't know about StringInfo. However, the two variants above also *display* as 2 characters on the screen. – Thomas Weller Jan 01 '14 at 23:51
What do you use to display the string on the screen? – Joni Jan 01 '14 at 23:53
I tried WinForms with a label and a textbox, default font (possibly Arial) – Thomas Weller Jan 01 '14 at 23:55
That should work. I have spotted another problem though; see the update. – Joni Jan 01 '14 at 23:57

Joachim Isaksson · Answer 3 · 2014-01-02T00:30:55.730

2

According to the C# specification, characters of more than 4 hex characters' length are encoded using \U (uppercase U) and 8 hexadecimal characters. Once encoded correctly in the string, it can be correctly exported using any unicode encoding;

string s = "\U0001D11E";

foreach (var b in Encoding.UTF32.GetBytes(s))
    Console.WriteLine(b.ToString("x2"));

Console.WriteLine();

foreach (var b in Encoding.Unicode.GetBytes(s))
    Console.WriteLine(b.ToString("x2"));

> 1e
> d1
> 01
> 00
>
> 34
> d8
> 1e
> dd

edited Jan 02 '14 at 00:30

answered Jan 01 '14 at 23:59

Joachim Isaksson

176,943
25
281
294

Your example uses UTF-32 to get the bytes. I asked for UTF-16. – Thomas Weller Jan 02 '14 at 00:13
@ThomasW. I just used UTF32 to show in a clear way that the 4 byte character was encoded correctly into the string using `\U`. UTF-16, being a less than 4 byte per character multi byte encoding makes the connection between the hex dump of the bytes and the original value less clear. – Joachim Isaksson Jan 02 '14 at 00:17
@ThomasW. Added a UTF-16 example. – Joachim Isaksson Jan 02 '14 at 00:32

Defining 4-byte UTF-16 character in a string

3 Answers3

Linked