Why are Chinese string lengths different in C++ and Python?

Asked Nov 20 '21 at 05:12

Active Nov 20 '21 at 05:20

Viewed 166 times

For example:

s = “兰蔻面膜”

In Python, its length is:

>>> len(“兰蔻面膜”)
4

But in C++, len=12 as below:

cout<< s.length()<<endl;
12

Why is that? I am simply checking the length of Chinese string in c++ IDE, and found its length is 12. The 's' has 4 characters.

edited Nov 20 '21 at 05:20

asked Nov 20 '21 at 05:12

marlon

6,029
8
42
76

Actually in c++, its length is 12. – marlon Nov 20 '21 at 05:15
6

I imagine number of symbols vs number of bytes required to encode the symbols – mozway Nov 20 '21 at 05:15
We have no idea what `s` is... – Evg Nov 20 '21 at 05:18
Should be `std::wstring s = L"兰蔻面膜";` which length is `4`. – 康桓瑋 Nov 20 '21 at 05:21
@康桓瑋 it works in this case but not all. Because not all Unicode characters are represented by 1 `wchar_t` – phuclv Nov 20 '21 at 05:23
2

`len("兰蔻面膜".encode("utf8"))` is 12. – jtbandes Nov 20 '21 at 05:24
@phuclv It should, and does work on GNU C, which has `sizeof(wchar_t) == 4`. Although unfortunately it doesn't work on MSVC, but I guess OP is not using MSVC because usually windows don't have utf-8 as default encoding right? – yyyy Nov 20 '21 at 05:55
1

@yyyy try something like `L"‍"`, `L"‍‍‍"`, `L"‍❤️‍‍"`, `L""`, `L"Å"`, `L"각"`, `L"நி"`, `L"षि"`, `L""`, `L"❤️"`, `L"é"`... to see if any of them has length = 1 even when `sizeof(wchar_t) == 4`. UTF-32 doesn't mean fixed-length characters – phuclv Nov 20 '21 at 07:49
@phuclv `L"Å"` could have length 1 (if codepoint U+00C5 has been used)? (I hexdumped your sample and see you used an A and a ring separately. How underhanded...) ;-) – Scheff's Cat Nov 20 '21 at 08:08
1

@Scheff'sCat that's possible after [normalization](https://www.unicode.org/reports/tr15/), but only a few characters have such equivalence, mostly Latin letters. Lots of characters are composed of multiple code points and you can't count the code units to get the string length – phuclv Nov 20 '21 at 08:54

Why are Chinese string lengths different in C++ and Python?

0 Answers0