0

I was trying to traverse a std::wstring, here's my code:

#include <iostream>

int main() {
    std::wstring ws;
    std::getline(std::wcin, ws);
    for (auto wc : ws) {
        std::wcout << wc << std::endl;
    }
}

When I tried to run this program, typed “你好” into the console, the program just printed 4 blank lines.

What I expect the program to output:

你
好

I have searched this site and came back with no solution.

What should I do to produce the result I expect?

1 Answers1

-1

First: This is an encoding problem, so it has not much connection to wstring, a string would probably have the same problem. And the size of wchar and encoding are system dependent, so your code would probably work under linux.

The explanation for your result is that under windows a wstring has 2 bytes per character and it uses UTF-16 encoding, but UTF-16 is a variable-length encoding and I am pretty sure that your (Chinese?) symbols can not be represented in 2 bytes but they need more space.

So for your exact example you could use some function or wrapper class that gives you full code points instead of code units, but I personally do not know any library that do so, because I follow my own advice:

But: I recommend to read http://utf8everywhere.org/ , especially the part about code point, code unit, abstract character and so on, and then stick to UTF-8 and the opaque data argument.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
gerum
  • 974
  • 1
  • 8
  • 21
  • Thank you. But a Chinese character does take up 2 bytes. – F1yingC0der Oct 08 '22 at 11:41
  • @F1yingC0der that is incorrect. Each of the characters in your question takes 2 `wchar_t` units under Windows, and each unit is 2 bytes. – Mark Ransom Oct 08 '22 at 14:35
  • 2
    @MarkRansom U+4F60 (你) and U+597D (好) each use a single code unit in UTF-16. – Mark Tolonen Oct 08 '22 at 16:25
  • @MarkTolonen thanks, my mistake. I used Python to quickly get the UTF-16 equivalents, and it snuck a BOM on the front of each! Next time I'll try to remember to use `'utf-16le'`. – Mark Ransom Oct 08 '22 at 17:32