0

I have a string like this:

string a1 = "{`name`:`санкт_петербург`,`shortName`:`питер`,`hideByDefault`:false}";

a1. length shows that string length is 68, which is not true: Cyrillic symbols are twice as big (because of UTF-16 encoding, I presume), therefore the real length of this string is 87.

I need to either get the number of Cyrillic symbols in the string or get real string length in any other way.

Mr Scapegrace
  • 293
  • 3
  • 14
  • 2
    *All* .NET strings are UTF16. *All* characters are 2 bytes long. `.Length` returns the number of characters, not bytes. The string has 68 characters and takes 136 bytes – Panagiotis Kanavos Mar 15 '17 at 08:14
  • 2
    What about `Encoding.GetByteCount`? – Patrick Hofman Mar 15 '17 at 08:16
  • Even with "ASCII" characters, the length and size remain the same. For `name`, the length is 4 and uses 8 bytes. – Panagiotis Kanavos Mar 15 '17 at 08:21
  • 1
    @PanagiotisKanavos Actually, all characters in UTF16 aren't 2 bytes long. Some are longer. – Matthew Watson Mar 15 '17 at 08:47
  • @MatthewWatson typically emojis and Chinese. In the most common case encountered outside Asia, it's 2 bytes - Unless you are Tacoma Airport, where all announcements are also in Chinese – Panagiotis Kanavos Mar 15 '17 at 09:00
  • @PanagiotisKanavos Aye, and the point is you can't just multiply the string's length by 2 to get the number of bytes - you MUST use the `GetByteCount()` method (or convert to byte array and check its length, but of course that would be horribly inefficient if all you want is the byte length). – Matthew Watson Mar 15 '17 at 09:38

1 Answers1

7

From the MSDN:

The .NET Framework uses the UTF-16 encoding (represented by the UnicodeEncoding class) to represent characters and string

So a1.Length is in UTF-16 code units (What's the difference between a character, a code point, a glyph and a grapheme?). Cyrillic characters, being in the base BMP (Base Multilingual Plane), all use a single code unit (so a single char). Many emoji for example use TWO code units (two char, 4 bytes!)... They aren't in the BMP. See for example https://ideone.com/ASDORp.

If you want the size IN BYTES, a1.Length * 2 clearly is the length :-) If you want to know in UTF8 (a very common encoding, NOT USED INTERNALLY BY .NET, but very used by the web, xml, ...) how many bytes it would be Encoding.UTF8.GetByteCount(a1)

Community
  • 1
  • 1
xanatos
  • 109,618
  • 12
  • 197
  • 280