2

I am trying to determine the implications of character encoding for a software system I am planning, and I found something odd while doing a test.

To my knowledge C# internally uses UTF-16 which (to my knowledge) encompasses every Unicode code point using two 16-bit fields. So I wanted to make some character literals and intentionally chose and 얤, because the former is from the SMP plane and the latter is from the BMP plane. The results are:

char ch1 = '얤'; // No problem
char ch2 = ''; // Compilation error "Too many characters in character literal"

What's going on?

A corollary of this question is, if I have the string "얤얤" it is displayed correctly in a MessageBox, however when I convert it to a char[] using ToCharArray I get an array with four elements rather than three. Also the String.Length is reported as four rather than three.

Am I missing something here?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • 1
    It is probably saving the complex character as Unicode Code Point which is typically two chars in length. – Jesan Fafon May 10 '13 at 16:03
  • @RaymondChen sharp eye... I tried a few different searches and found nothing relevant! –  May 10 '13 at 16:44
  • Can I somehow close this question as a duplicate then? –  May 10 '13 at 16:50

2 Answers2

0

MSDN says that the char type can represent Unicode 16-bit character (thus only character form BMP).

If you use a character outside BMP (in UTF-16: supplementary pair - 2x16 bit) compiler treats that as two characters.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
el vis
  • 1,302
  • 2
  • 16
  • 32
-1

Your source file may not be saved in UTF-8 (which is recommended when using special characters in the source), so the compiler may actually see a sequence of bytes that confuses it. You can verify that by opening your source file in a hex editor - the byte(s) you'll see in place of your character will likely be different.

If it's not already on, you can turn on that setting in Tools->Options->Documents in Visual Studio (I use 2008) - the option is Save documents as Unicode when data cannot be saved in codepage.

Typically, it's better to specify special characters using a character sequence.

This MSDN article describes how to use \uxxxx sequences to specify the Unicode character code you want. This blog entry has all the various C# escape sequences listed - the reason I'm including it is because it mentions using \xnnn - avoid using this format: it's a variable length version of \u and it can cause issues in some situations (not in yours, though).

The MSDN article points out why the character assignment is no good: the code point for the character in question is > FFFF which is outside the range for the char type.

As for the string part of the question, the answer is that the SMP character is represented as two char values. This SO question includes some code showing how to get the code points out of a string, it involves the use of StringInfo.GetTextElementEnumerator

xxbbcc
  • 16,930
  • 5
  • 50
  • 83
  • The answer was in fact in the MSDN article you quoted... the code point for is 1D6C03 and the article states that code points above 10FFFF are not supported. Thanks! –  May 10 '13 at 16:06
  • Strictly my previous comment is incorrect, the MSDN article states that "a Unicode character in the range U+10000 to U+10FFFF is not permitted in a character literal and is represented using a Unicode surrogate pair in a string literal", but then it goes on to say that in strings Unicode characters beyond 10FFFF are not supported. Just to clarify... the char literal is invalid because the code point is above FFFF, the string literal is invalid because the code point is above 10FFFF –  May 10 '13 at 16:13
  • What I don't understand is why my string with the >10FFFF code point displays correctly in a MessageBox... –  May 10 '13 at 16:15
  • @Paul this link may answer that (I didn't know this limitation before): http://stackoverflow.com/questions/8369772/got-compile-error-when-using-u-escape – xxbbcc May 10 '13 at 16:24
  • Actually the answer as of now is that I suck... I mis-created the code point, it isn't 1D6C03 it's 1D6C3... I went and preceded my 3 with a 0 out of stupidity. So 1D6C3 < 10FFFF obviously... which means my assessment in my first 2 comments is incorrect. The only part of my initial assessment that remains is that 1D6C3 > FFFF so it's not good in a char literal and must be in a string. –  May 10 '13 at 16:30
  • @Paul lol, don't worry about that - happens to me all the time. :) – xxbbcc May 10 '13 at 16:31
  • OK so I took a clue from http://stackoverflow.com/questions/687359/how-would-you-get-an-array-of-unicode-code-points-a-net-string and successfully pulled the correct code points out of the string with the bad character. I think the trick lies somewhere in the String.Length and String.ToCharArray members, perhaps they don't deal with surrogate pairs correctly and that's why the StringInfo.GetTextElementEnumerator was created... –  May 10 '13 at 16:38
  • @Paul String.Length actually counts surrogate pairs as 2 characters - this is a limitation of UTF-16 - a surrogate pair uses up 2 character slots. The pair, however, is not the actual Unicode code of the character it represents - it's a surrogate marker and a code. – xxbbcc May 10 '13 at 16:40
  • You're right... I just found that in http://stackoverflow.com/questions/5656472/what-does-the-net-string-length-property-return-surrogate-neutral-length-or-co –  May 10 '13 at 16:41
  • @Paul Is there a specific reason why you un-accepted my answer? – xxbbcc May 10 '13 at 16:48
  • Yes... originally I thought your answer was "correct" because the MSDN article linked to talks about limits on the code point values in chars and strings; the char part is good but the string part is really answered by your second-to-last comment and my comment after it. –  May 10 '13 at 16:58
  • I guess I can add that link to your answer and re-accept it. –  May 10 '13 at 16:59