Okay, in .Net and C# all strings are encoded as UTF-16LE. A string
is stored as a sequence of chars. Each char
encapsulates the storage of 2 bytes or 16 bits.
What we see "on paper or screen" as a single letter, character, glyph, symbol, or punctuation mark can be thought of as a single Text Element. As described in Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION, each Text Element is represented by one or more Code Points. An exhaustive list of Codes can be found here.
Each Code Point needs to encoded into binary for internal representation by a computer. As stated, each char
stores 2 bytes. Code Points at or below U+FFFF
can be stored in a single char
. Code Points above U+FFFF
are stored as a surrogate pair, using two chars to represent a single Code Point.
Given what we now know we can deduce, a Text Element can be stored as one char
, as a Surrogate Pair of two chars or, if the Text Element is represented by multiple Code Points some combination of single chars and Surrogate Pairs. As if that weren't complicated enough, some Text Elements can be represented by different combinations of Code Points as described in, Unicode Standard Annex #15, UNICODE NORMALIZATION FORMS.
Interlude
So, strings that look the same when rendered can actually be made up of a different combination of chars. An ordinal (byte by byte) comparison of two such strings would detect a difference, this may be unexpected or undesirable.
You can re-encode .Net strings. so that they use the same Normalization Form. Once normalized, two strings with the same Text Elements will be encoded the same way. To do this, use the string.Normalize function. However, remember, some different Text Elements look similar to each other. :-s
So, what does this all mean in relation to the question? The Text Element ''
is represented by the single Code Point U+20213 cjk unified ideographs extension b. This means it cannot be encoded as a single char
and must be encoded as Surrogate Pair, using two chars. This is why string b
is one char
longer that string a
.
If you need to reliably (see caveat) count the number of Text Elements in a string
you should use the
System.Globalization.StringInfo
class like this.
using System.Globalization;
string a = "abc";
string b = "AC";
Console.WriteLine("Length a = {0}", new StringInfo(a).LengthInTextElements);
Console.WriteLine("Length b = {0}", new StringInfo(b).LengthInTextElements);
giving the output,
"Length a = 3"
"Length b = 3"
as expected.
Caveat
The .Net implementation of Unicode Text Segmentation in the StringInfo
and TextElementEnumerator
classes should be generally useful and, in most cases, will yield a response that the caller expects. However, as stated in Unicode Standard Annex #29, "The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries."