How to traverse a wstring properly?

Question

I was trying to traverse a std::wstring, here's my code:

#include <iostream>

int main() {
    std::wstring ws;
    std::getline(std::wcin, ws);
    for (auto wc : ws) {
        std::wcout << wc << std::endl;
    }
}

When I tried to run this program, typed “你好” into the console, the program just printed 4 blank lines.

What I expect the program to output:

你
好

I have searched this site and came back with no solution.

What should I do to produce the result I expect?

`wstring` is not the same everywhere. Which OS and compiler did you use? — Mark Ransom, Oct 08 '22 at 03:49
You should add the details of the compiler and platform to the question itself, not as a comment. — Adrian Mole, Oct 08 '22 at 13:39
See the duplicate. On Windows the console mode needs to be changed to support entering and printing via wcin/wcout. — Mark Tolonen, Oct 08 '22 at 16:36

score -1 · Answer 1 · edited Oct 19 '22 at 04:13

-1

First: This is an encoding problem, so it has not much connection to wstring, a string would probably have the same problem. And the size of wchar and encoding are system dependent, so your code would probably work under linux.

The explanation for your result is that under windows a wstring has 2 bytes per character and it uses UTF-16 encoding, but UTF-16 is a variable-length encoding and I am pretty sure that your (Chinese?) symbols can not be represented in 2 bytes but they need more space.

So for your exact example you could use some function or wrapper class that gives you full code points instead of code units, but I personally do not know any library that do so, because I follow my own advice:

But: I recommend to read http://utf8everywhere.org/ , especially the part about code point, code unit, abstract character and so on, and then stick to UTF-8 and the opaque data argument.

edited Oct 19 '22 at 04:13

marc_s

732,580
175
1,330
1,459

answered Oct 08 '22 at 08:55

gerum

974
1
8
21

Thank you. But a Chinese character does take up 2 bytes. – F1yingC0der Oct 08 '22 at 11:41
@F1yingC0der that is incorrect. Each of the characters in your question takes 2 `wchar_t` units under Windows, and each unit is 2 bytes. – Mark Ransom Oct 08 '22 at 14:35
2

@MarkRansom U+4F60 (你) and U+597D (好) each use a single code unit in UTF-16. – Mark Tolonen Oct 08 '22 at 16:25
@MarkTolonen thanks, my mistake. I used Python to quickly get the UTF-16 equivalents, and it snuck a BOM on the front of each! Next time I'll try to remember to use `'utf-16le'`. – Mark Ransom Oct 08 '22 at 17:32

How to traverse a wstring properly?

1 Answers1