Why is the length of this string longer than the number of characters in it?

Question

This code:

string a = "abc";
string b = "AC";
Console.WriteLine("Length a = {0}", a.Length);
Console.WriteLine("Length b = {0}", b.Length);

outputs:

Length a = 3
Length b = 4

Why? The only thing I could imagine is that the Chinese character is 2 bytes long and that the .Length method returns the byte count.

How did I know it was a surrogate pair problem just from looking at the title. Ah, good 'ol System.Globalization is your ally! — Chris Cirefice, Nov 17 '14 at 15:54
the decimal value of the char `` is 131603, and as chars are unsigned bytes, that means you can achieve that value in 2 characters rather than 4 (unsigned 16 bit value max is 65535 (or 65536 variations) and using 2 chars to represent it allows for a maximum number of variations of not 65536*2(131072) but rather 65536*65536 variations( 4,294,967,296, effectively a 32 bit value) — GMasucci, Nov 18 '14 at 12:07
@GMAsucci: It's 2 characters in UTF-16, but 4 bytes, because a UTF16 character is 2 bytes in size, otherwise it could not store 65536 variations, but only 256. — Kaiserludi, Nov 18 '14 at 18:04
@GMasucci you cannot store 4,294,967,296 different codepoints with UTF-16 as some bits are used to denote surrogate pair — phuclv, Nov 19 '14 at 04:11
Indeed, surrogate pairs are just enough to store 20 bits of payload, meaning 16*65536 possible codepoints (out of 17*65536 codepoints defined in all of Unicode) — Medinoc, Nov 19 '14 at 10:07
I stand happily corrected:) I was trying to point out the potential possible combinations not the actually available ones though, but still I should have had a clearer comment. Cheers guys:) — GMasucci, Nov 19 '14 at 12:02
As a detail, I would say these strings are probably encoded in the user's preferred multi-byte code page, not UTF-8 as everyone seems to assume. — Khouri Giordano, Nov 19 '14 at 15:40
Interesting, I get same result in JavaScript: `"AC".length // 4` — Salman A, Nov 19 '14 at 17:13
I recommend reading the great article 'The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)' http://www.joelonsoftware.com/articles/Unicode.html — ItsMe, Nov 20 '14 at 13:32

Adam D. Ruppe · Accepted Answer · 2014-11-18T19:52:56.237

Everyone else is giving the surface answer, but there's a deeper rationale too: the number of "characters" is a difficult-to-define question and can be surprisingly expensive to compute, whereas a length property should be fast.

Why is it difficult to define? Well, there's a few options and none are really more valid than another:

The number of code units (bytes or other fixed size data chunk; C# and Windows typically use UTF-16 so it returns the number of two-byte pieces) is certainly relevant, as the computer still needs to deal with the data in that form for many purposes (writing to a file, for example, cares about bytes rather than characters)
The number of Unicode codepoints is fairly easy to compute (although O(n) because you gotta scan the string for surrogate pairs) and might matter to a text editor.... but isn't actually the same thing as the number of characters printed on screen (called graphemes). For example, some accented letters can be represented in two forms: a single codepoint, or two points paired together, one representing the letter, and one saying "add an accent to my partner letter". Would the pair be two characters or one? You can normalize strings to help with this, but not all valid letters have a single codepoint representation.
Even the number of graphemes isn't the same as the length of a printed string, which depends on the font among other factors, and since some characters are printed with some overlap in many fonts (kerning), the length of a string on screen is not necessarily equal to the sum of the length of graphemes anyway!
Some Unicode points aren't even characters in the traditional sense, but rather some kind of control marker. Like a byte order marker or a right-to-left indicator. Do these count?

In short, the length of a string is actually a ridiculously complex question and calculating it can take a lot of CPU time as well as data tables.

Moreover, what's the point? Why does these metrics matter? Well, only you can answer that for your case, but personally, I find they are generally irrelevant. Limiting data entry I find is more logically done by byte limits, as that's what needs to be transferred or stored anyway. Limiting display size is better done by the display side software - if you have 100 pixels for the message, how many characters you fit depends on the font, etc., which isn't known by the data layer software anyway. Finally, given the complexity of the unicode standard, you're probably going to have bugs at the edge cases anyway if you try anything else.

So it is a hard question with not a lot of general purpose use. Number of code units is trivial to calculate - it is just the length of the underlying data array - and the most meaningful/useful as a general rule, with a simple definition.

That's why b has length 4 beyond the surface explanation of "because the documentation says so".

Essentially '.Length' isn't what most coders think it is. Maybe there should be a set of more specific properties (e.g. GlyphCount) and Length marked as Obsolete! — redcalx, Nov 19 '14 at 12:53
@locster I agree, but don't think `Length` should be obsolete, to maintain the analogy with arrays. — This company is turning evil., Nov 19 '14 at 13:32
@locster It shouldn't be obsolete. The python one makes a lot of sense and nobody questions it. — simonzack, Nov 19 '14 at 13:48
I think .Length makes a lot of sense and is a natural property, as long as you understand what it is and why it is that way. Then it works like any other array (in some languages like D, a string literally is an array as far as the language is concerned and it works really well) — Adam D. Ruppe, Nov 19 '14 at 15:05
However, the discussion is specifically about C# where strings are made up of unicode chars using Window's standard internal two byte encoding and for which the concept of string length is a somewhat fuzzy concept in some corner cases. — redcalx, Nov 19 '14 at 15:55
String length is always a fuzzy concept with Unicode - even with UTF-32, where you don't have to think about surrogate pairs, there's still combining characters, etc., that complicate matters. — Adam D. Ruppe, Nov 19 '14 at 17:06
All C# strings are encoded as UTF-16 LE. However, they are not necessarily normalized in any particular way. — Jodrell, Nov 20 '14 at 08:58
That's not true (a common misconception) - with UTF-32 , lengthInBytes / 4 would give the number of *code points*, but that is *not* the same as the number of "characters" or graphemes. Consider LATIN SMALL LETTER E followed by a COMBINING DIAERESIS... that prints as a single character, it can even be normalized to a single codepoint, but it is still two units long, even in UTF-32. — Adam D. Ruppe, Nov 20 '14 at 13:46
@AdamD.Ruppe, Agreed, I've clarified my understanding since my previous comment (which is now deleted.) — Jodrell, Nov 21 '14 at 10:21
Just a little addendum: just can have more than one accent (or generally speaking, combining character), like in ọ̵̌ or ɘ̧̊̄ It should be clear that you can’t have predefined unicode codepoints for all possible combinations. — Holger, Nov 21 '14 at 11:18
Adam this is really a good answer, there are some other issues that you didnt mention, like letters that melt together so that two characters form one grapheme/glyph. — Erdinc Ay, Dec 02 '14 at 09:30

score 64 · Answer 2 · edited Nov 21 '14 at 09:54

64

From the documentation of the String.Length property:

The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

edited Nov 21 '14 at 09:54

Cristian Ciupitu

20,270
7
50
76

answered Nov 17 '14 at 15:15

nanny

1,098
9
19

3

Java behaves in the same way (also printing 4 for `String b`), as it uses the UTF-16 representation in char arrays. It's a 4 byte character in UTF-8. – Michael Nov 17 '14 at 17:17

score 32 · Answer 3 · answered Nov 17 '14 at 15:24

Your character at index 1 in "AC" is a SurrogatePair

The key point to remember is that surrogate pairs represent 32-bit single characters.

You can try this code and it will return True

Console.WriteLine(char.IsSurrogatePair("AC", 1));

Char.IsSurrogatePair Method (String, Int32)

true if the s parameter includes adjacent characters at positions index and index + 1, and the numeric value of the character at position index ranges from U+D800 through U+DBFF, and the numeric value of the character at position index+1 ranges from U+DC00 through U+DFFF; otherwise, false.

This is further explained in String.Length property:

The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

dee-see · Answer 4 · 2014-11-20T15:08:08.257

As the other answers have pointed out, even if there are 3 visible character they are represented with 4 char objects. Which is why the Length is 4 and not 3.

MSDN states that

The Length property returns the number of Char objects in this instance, not the number of Unicode characters.

However if what you really want to know is the number of "text elements" and not the number of Char objects you can use the StringInfo class.

var si = new StringInfo("AC");
Console.WriteLine(si.LengthInTextElements); // 3

You can also enumerate each text element like this

var enumerator = StringInfo.GetTextElementEnumerator("AC");
while(enumerator.MoveNext()){
    Console.WriteLine(enumerator.Current);
}

Using foreach on the string will split the middle "letter" in two char objects and the printed result won't correspond to the string.

Yuval Itzchakov · Answer 5 · 2014-11-18T07:30:54.580

20

That is because the Length property returns the number of char objects, not the number of unicode characters. In your case, one of the Unicode characters is represented by more than one char object (SurrogatePair).

The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

edited Nov 18 '14 at 07:30

answered Nov 17 '14 at 15:15

Yuval Itzchakov

146,575
32
257
321

1

You have an ambiguous use of "character" in this answer. I suggest replacing at least the first one with precise terminology. – Lightness Races in Orbit Nov 17 '14 at 23:28

score 10 · Answer 6 · answered Nov 17 '14 at 17:35

As others said, it's not the number of characters in the string but the number of Char objects. The character is code point U+20213. Since the value is outside 16-bit char type's range, it's encoded in UTF-16 as the surrogate pair D840 DE13.

The way to get the length in characters was mentioned in the other answers. However it should be use with care as there can be many ways to represent a character in Unicode. "à" may be 1 composed character or 2 characters (a + diacritics). Normalization may be needed like in the case of twitter.

You should read this
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

score 7 · Answer 7 · answered Nov 18 '14 at 23:08

This is because length() only works for Unicode code points that are no larger than U+FFFF. This set of code points is known as the Basic Multilingual Plane (BMP) and uses only 2 bytes.

Unicode code points outside of the BMP are represented in UTF-16 using 4 byte surrogate pairs.

To correctly count the number of characters (3), use StringInfo

StringInfo b = new StringInfo("AC");
Console.WriteLine(string.Format("Length 2 = {0}", b.LengthInTextElements));

Jodrell · Answer 8 · 2014-11-24T09:43:53.847

Okay, in .Net and C# all strings are encoded as UTF-16LE. A string is stored as a sequence of chars. Each char encapsulates the storage of 2 bytes or 16 bits.

What we see "on paper or screen" as a single letter, character, glyph, symbol, or punctuation mark can be thought of as a single Text Element. As described in Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION, each Text Element is represented by one or more Code Points. An exhaustive list of Codes can be found here.

Each Code Point needs to encoded into binary for internal representation by a computer. As stated, each char stores 2 bytes. Code Points at or below U+FFFF can be stored in a single char. Code Points above U+FFFF are stored as a surrogate pair, using two chars to represent a single Code Point.

Given what we now know we can deduce, a Text Element can be stored as one char, as a Surrogate Pair of two chars or, if the Text Element is represented by multiple Code Points some combination of single chars and Surrogate Pairs. As if that weren't complicated enough, some Text Elements can be represented by different combinations of Code Points as described in, Unicode Standard Annex #15, UNICODE NORMALIZATION FORMS.

Interlude

So, strings that look the same when rendered can actually be made up of a different combination of chars. An ordinal (byte by byte) comparison of two such strings would detect a difference, this may be unexpected or undesirable.

You can re-encode .Net strings. so that they use the same Normalization Form. Once normalized, two strings with the same Text Elements will be encoded the same way. To do this, use the string.Normalize function. However, remember, some different Text Elements look similar to each other. :-s

So, what does this all mean in relation to the question? The Text Element '' is represented by the single Code Point U+20213 cjk unified ideographs extension b. This means it cannot be encoded as a single char and must be encoded as Surrogate Pair, using two chars. This is why string b is one char longer that string a.

If you need to reliably (see caveat) count the number of Text Elements in a string you should use the System.Globalization.StringInfo class like this.

using System.Globalization;

string a = "abc";
string b = "AC";

Console.WriteLine("Length a = {0}", new StringInfo(a).LengthInTextElements);
Console.WriteLine("Length b = {0}", new StringInfo(b).LengthInTextElements);

giving the output,

"Length a = 3"
"Length b = 3"

as expected.

Caveat

The .Net implementation of Unicode Text Segmentation in the StringInfo and TextElementEnumerator classes should be generally useful and, in most cases, will yield a response that the caller expects. However, as stated in Unicode Standard Annex #29, "The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries."

I think your answer is potentially confusing. In this case, is only a single code point, but since its code point exceeds 0xFFFF, it must be represented as 2 code units by using surrogate pair. Grapheme is another concept built on top of code point, where a grapheme can be represented by a single code point or multiple code points, as seen in Korean's Hangul or many Latin-based languages. — nhahtdh, Nov 21 '14 at 07:16
@nhahtdh, I agree, my answer was erroneous. I've rewritten it and hopefully it now creates greater clarity. — Jodrell, Nov 21 '14 at 10:12

Why is the length of this string longer than the number of characters in it?

8 Answers8

Linked