I am trying to determine the implications of character encoding for a software system I am planning, and I found something odd while doing a test.
To my knowledge C# internally uses UTF-16 which (to my knowledge) encompasses every Unicode code point using two 16-bit fields. So I wanted to make some character literals and intentionally chose and 얤, because the former is from the SMP plane and the latter is from the BMP plane. The results are:
char ch1 = '얤'; // No problem
char ch2 = ''; // Compilation error "Too many characters in character literal"
What's going on?
A corollary of this question is, if I have the string "얤얤" it is displayed correctly in a MessageBox, however when I convert it to a char[] using ToCharArray I get an array with four elements rather than three. Also the String.Length is reported as four rather than three.
Am I missing something here?