1

For example:

s = “兰蔻面膜”

In Python, its length is:

>>> len(“兰蔻面膜”)
4

But in C++, len=12 as below:

cout<< s.length()<<endl;
12

Why is that? I am simply checking the length of Chinese string in c++ IDE, and found its length is 12. The 's' has 4 characters.

marlon
  • 6,029
  • 8
  • 42
  • 76
  • Actually in c++, its length is 12. – marlon Nov 20 '21 at 05:15
  • 6
    I imagine number of symbols vs number of bytes required to encode the symbols – mozway Nov 20 '21 at 05:15
  • We have no idea what `s` is... – Evg Nov 20 '21 at 05:18
  • Should be `std::wstring s = L"兰蔻面膜";` which length is `4`. – 康桓瑋 Nov 20 '21 at 05:21
  • @康桓瑋 it works in this case but not all. Because not all Unicode characters are represented by 1 `wchar_t` – phuclv Nov 20 '21 at 05:23
  • 2
    `len("兰蔻面膜".encode("utf8"))` is 12. – jtbandes Nov 20 '21 at 05:24
  • @phuclv It should, and does work on GNU C, which has `sizeof(wchar_t) == 4`. Although unfortunately it doesn't work on MSVC, but I guess OP is not using MSVC because usually windows don't have utf-8 as default encoding right? – yyyy Nov 20 '21 at 05:55
  • 1
    @yyyy try something like `L"‍"`, `L"‍‍‍"`, `L"‍❤️‍‍"`, `L""`, `L"Å"`, `L"각"`, `L"நி"`, `L"षि"`, `L""`, `L"❤️"`, `L"é"`... to see if any of them has length = 1 even when `sizeof(wchar_t) == 4`. UTF-32 doesn't mean fixed-length characters – phuclv Nov 20 '21 at 07:49
  • @phuclv `L"Å"` could have length 1 (if codepoint U+00C5 has been used)? (I hexdumped your sample and see you used an A and a ring separately. How underhanded...) ;-) – Scheff's Cat Nov 20 '21 at 08:08
  • 1
    @Scheff'sCat that's possible after [normalization](https://www.unicode.org/reports/tr15/), but only a few characters have such equivalence, mostly Latin letters. Lots of characters are composed of multiple code points and you can't count the code units to get the string length – phuclv Nov 20 '21 at 08:54

0 Answers0