1

I have a string which consist of the mixture of Chinese characters and displayable ASCII codes.

string str = "Test測試123";

When I use str.Length or str.ToCharArray(), it all return the Chinese character each as 1 character! Which is not true because any Chinese character is 2 byte!

Even if I try Encoding.ASCII.GetBytes(str), it just give me 63s in ALL the Chinese characters!!! And it turned out to be the same result as Length or ToCharArray()!

Which is the wrong result for my purpose!!!

Is there any way to get the actual length of a string!?

In the example I just given: 11 instead of 9!?

Salah Akbari
  • 39,330
  • 10
  • 79
  • 109
Pikachu620
  • 483
  • 4
  • 17
  • 4
    *composed of 2 ASCII codes* no... If converted to UTF8 then chinese characters are two bytes... something very different. Try `Encoding.UTF8.GetByteCount(str)` – xanatos May 24 '18 at 10:26
  • That's what I mean! Updated! – Pikachu620 May 24 '18 at 10:28
  • @bommelding: I'm sorry that I don't really understand what you mean! – Pikachu620 May 24 '18 at 10:31
  • Strings in .NET are UTF16 by default. For this reason, the size of the string (in bytes) will be different than its length (number of characters). Which one do you need? – Martin May 24 '18 at 10:32
  • 1
    I tried `Encoding.UTF8.GetByteCount(str)`, but it give me a size **bigger** than what it actually is!!! In the example I'd given: 13 instead of 11! Where that extra 2 come from!?!?!? – Pikachu620 May 24 '18 at 10:35
  • @Martin I would like to get the, how could I say, **size of the byte array it use to store the string!?** For the example I'd given: For each Chinese character it use 2 bytes, and 1 byte for each other characters. So it add up to 11 total! Could that be done!? Thanks! – Pikachu620 May 24 '18 at 10:39

3 Answers3

7

Length in the Unicode world is always fun... What Length do you need? For example:

string str = "";

// Length in UTF-16 code units
int len = str.Length; // 2

// Length in bytes, if encoded in UTF16, as done by .NET
int len2 = str.Length * 2; // 4

// Length in bytes, if encoded in UTF8
int len3 = Encoding.UTF8.GetByteCount(str); // 4

// Length in unicode code points
int len4 = Encoding.UTF32.GetByteCount(str) / 4; // 1

Note that there is a fifth length: Length in number of grapheme cluster, that is even more complex to calculate, because some codepoints can "merge" together, and a sixth: Length in number of Glyphs.

Now, your string has len equal to 9, len2 equal to 18, len3 (so the length in bytes if converted to UTF8) equal to 13, len4 equal to 9.

Nearly all the chinese characters are in the Basic Multilingual Plane of the Unicode standard, so they have a length of 1 UTF-16 code unit, and they are mappable to 2 or 3 bytes in UTF8.

Some interesting reference: What's the difference between a character, a code point, a glyph and a grapheme? .

Ah... and please forget about the Encoding.ASCII. Live like it doesn't exist. It probably isn't what you think it is. Even if you lived in the old MS DOS world with its funny characters, that wasn't ASCII.

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • 2
    Thanks for your answer! Which give me the idea! It's all about the encoding! In my case, it's BIG5! So if I use **`Encoding.GetEncoding("BIG5").GetByteCount(str)`**, it would give me the answer I'm looking for! Thank you very **VERY** much!!! – Pikachu620 May 24 '18 at 10:56
  • 1
    "Live like it doesn't exist.": Yes! ASCII is required by certain standards. If such a standard is not referenced, it's almost certainly not ASCII. – Tom Blodget May 27 '18 at 16:25
  • @TomBlodget as you wrote :-) – xanatos May 27 '18 at 18:16
0

Space of Chinese's width is double as English, but character code's length is another story, UTF-8 Chinese take three bytes and English is always one byte.

//only for UTF-8
string s = "計算字串的長度this is a test";
int sLength = s.Length; //length is 21
int byteCount = Encoding.UTF8.GetByteCount(s); // byte count is 35
int chineseCount = (byteCount - sLength)/2; //Chinese count is 7
0

Base on @eldercharlie

int len = text.Length;
int byteCount = Encoding.UTF8.GetByteCount(text);
int width = (len + byteCount) / 2;
CodingNinja
  • 83
  • 1
  • 1
  • 11