In Unicode, you have code points. These are 21 bits long. Your character , Mathematical Bold Capital A
, has a code point of U+1D400.
In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point.
In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up.
This gets a little tricky in .NET, as a .NET Char
represents a single UTF-16 code unit, and a .NET String
is a collection of code units.
So your code point (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it:
var highUnicodeChar = "";
char a = highUnicodeChar[0]; // code unit 0xD835
char b = highUnicodeChar[1]; // code unit 0xDC00
Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.
You can use IsSurrogatePair to test for a surrogate pair. For instance:
string GetFullCodePointAtIndex(string s, int idx) =>
s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);
Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do.
To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.