UTF-8 : different behavior between inputed text and hardcoded

Question

I have 2 versions of a console program that deals with accented text, both inputted by the user and hardcoded in the program, but the behavior seems strange to me. The following code shows the issue :

Version 1 :

int main(){
    SetConsoleOutputCP(CP_UTF8);
    SetConsoleCP(CP_UTF8);

    string str1 = "héhé";
    cout << str1 << endl;

    string str2;
    cin >> str2;
    cout << str2 << endl;
}

output :

héhé //str1
héhé //input str2
h h  //str2

Version 2 :

int main(){
    string str1 = "héhé";
    cout << str1 << endl;

    string str2;
    cin >> str2;
    cout << str2 << endl;
}

output :

h├®h├® //str1
héhé //input str2
héhé //str2

I would have expected the text to be displayed properly with utf-8 codepage..

Is there a way to display both strings correctly ? Hopefully without using wstring since the entire program is already pretty complex.

Thanks a lot !

The console does not really support using UTF-8 as the input codepage. It translates all non-ASCII characters to null bytes. As long as that's the case, `SetConsoleCP(CP_UTF8)` should fail instead of pretending that UTF-8 is supported. You have to use the wide-character UTF-16 API to read Unicode from the console. — Eryk Sun, Jun 13 '20 at 17:35
Using UTF-8 for the output codepage is generally okay in Windows 8.1+ -- except if a buffered sequence gets split across flushes. It doesn't work right prior to Windows 8.1 because `WriteConsoleA` and `WriteFile` in older versions return the number of decoded UTF-16 16-bit words written to the console instead of the number of UTF-8 bytes written, which causes buffered writers to output a trailing sequence of garbage characters when text is written that contains non-ASCII characters. — Eryk Sun, Jun 13 '20 at 17:45
@ErykSun So it is not normal that SetConsoleCP(CP_UTF8) returns true ? And if the input is the problem, why the version 2 works fine on the inputed string ? — Raphael, Jun 13 '20 at 17:58
In the second example, the current input and output codepages map the character "é". For example, in codepage 850 it's `"\x82"`. In UTF-8, "é" is encoded as `"\xC3\xA9"`. Codepage 850 maps the latter to the Unicode string "├®", i.e. `"\u251C\u00AE"`, which is what gets stored in the console screen buffer and displayed in the window. — Eryk Sun, Jun 13 '20 at 18:52
Bear in mind that within the console host process (conhost.exe), the input and output buffers are stored as wide-character (16-bit word) strings. Translation to and from byte (8-bit) strings using the input and output codepages is only implemented to support the bytes API functions `ReadFile`, `WriteFile`, `ReadConsoleA`, `WriteConsoleA` -- as opposed to wide-character `ReadConsoleW` and `WriteConsoleW`. — Eryk Sun, Jun 13 '20 at 18:53
`SetConsoleCP(CP_UTF8)` returns true, but the API is lying to you. The console does not support setting the input codepage to UTF-8. They've had over 20 years to fix this and refuse to care about it. — Eryk Sun, Jun 13 '20 at 19:00

UTF-8 : different behavior between inputed text and hardcoded

0 Answers0