Is there a proper way to receive input from console in UTF-8 encoding?

Question

When getting input from std::cin in windows, the input is apparently always in the encoding windows-1252 (the default for the host machine in my case) despite all the configurations made, that apparently only affect to the output. Is there a proper way to capture input in windows in UTF-8 encoding?

For instance, let's check out this program:

#include <iostream>

int main(int argc, char* argv[])
{
    std::cin.imbue(locale("es_ES.UTF-8"));
    std::cout.imbue(locale("es_ES.UTF-8"));

    std::cout << "ñeñeñe> ";
    std::string in; 
    std::getline( std::cin, in ); 
    std::cout << in; 

}

I've compiled it using visual studio 2022 in a windows machine with spanish locale. The source code is in UTF-8. When executing the resulting program (windows powershell session, after executing chcp 65001 to set the default encoding to UTF-8), I see the following:

PS C:\> .\test_program.exe
ñeñeñe> ñeñeñe
 e e e

The first "ñeñeñe" is correct: it display correctly the "ñ" caracter to the output console. So far, so good. The user input is echoed back to the console correctly: another good point. But! when it turns to send back the encoded string to the ouput, the "ñ" caracter is substituted by an empty space.

When debugging this program, I see that the variable "in" have captured the input in an encoding that it is not utf-8: for the "ñ" it use only one character, whereas in utf-8 that caracter must consume two. The conclusion is that the input is not affect for the chcp command. Is something I doing wrong?

UPDATE

Somebody have asked me to see what happens when changing to wcout/wcin:

std::wcout << u"ñeñeñe> ";
std::wstring in;
std::getline(std::wcin, in);
std::wcout << in;

Behaviour:

PS C:\> .\test.exe
0,000,7FF,6D1,B76,E30ñeñeñe
 e e e

Other try (setting the string as L"ñeñeñe"):

Ã±eÃ±eÃ±e> ñeñeñe
 e e e

Leaving it as is:

std::wcout << "ñeñeñe> ";

Result is:

eee>

Yes, I've tried, and got some.... let's say flowery results. If using wcin, wcout only, the input string is output back as rubbish, because is encoded internally as ¿maybe UTF-16? and send back as utf-16 when the expected string should be in utf-8. I can update the question accordingly if you want — Raul Luna, Mar 14 '22 at 09:57
Yes, I think that's helpful to tell a) that you know about it and b) it does not solve the problem. — Thomas Weller, Mar 14 '22 at 10:01
It seems it's really that bad: https://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/#utf8_mode_input — Thomas Weller, Mar 14 '22 at 10:13
I wonder if it's a workaround for this, like using a windows forms or something like that that emulates a console. The problem is I want to create a simple console application that accepts and returns UTF-8 and it seems an impossible thing — Raul Luna, Mar 14 '22 at 10:28
It is not impossible but you might have to do it yourself with the CRT. — Anders, Mar 14 '22 at 11:50
@Anders, I really doubt that nobody have ever bumped on this. How it's done in Java, for instance??? AFAIK, java is compiled in C and internally their strings are UTF-8, so some kind of conversion is made behind the scenes — Raul Luna, Mar 14 '22 at 13:12
@RaulLuna: it uses different method. Consoles read the keys and translate into characters to send to the program. Many programs go in the hard way: read the scan code, interpret the keyboard, find which character you get. — Giacomo Catenazzi, Mar 14 '22 at 13:37
microsoft recently implemented pty API. It included also a series of blog post, which were interesting: it told us that many programs just put a console outside visible desktop )with active keyboard) and grab images of that console, because there were no other way to have reliable interface. Ugly. — Giacomo Catenazzi, Mar 14 '22 at 13:41
I meant to say without the CRT. Meaning, `ReadFile` on `GetStdHandle`. You of course have to convert the input bytes yourself if they are not already UTF-8. — Anders, Mar 14 '22 at 13:52
You can use `_setmode` on both stdout and stdin to read wide strings that you could convert to UTF-8. See if [this answer](https://stackoverflow.com/a/65816756/235698) helps you. — Mark Tolonen, Mar 14 '22 at 15:42
Yes, @MarkTolonen, I've tried that way, but the results are not satisfactory (see my own answer). The funny part is because the program is stored -or compiled, I don't know- into UTF-8, the string is converted into funny ways, resulting in rubbish most of the time — Raul Luna, Mar 15 '22 at 15:27
The best solution I've come so far come from the example program in this page: https://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/#utf8_mode_input. This allows to 1) enter the literal strings in unicode, 2) input text in unicode and the output will be in unicode also — Raul Luna, Mar 15 '22 at 17:07

Raul Luna · Accepted Answer · 2022-03-17T08:07:40.010

0

This is the closest to the solution I've found so far:

int main(int argc, char* argv[])
{
    _setmode(_fileno(stdout), _O_WTEXT);
    _setmode(_fileno(stdin), _O_WTEXT);

    std::wcout << L"ñeñeñe";
    std::wstring in;
    std::getline(std::wcin, in);
    std::wcout << in;

    return 0;
}

The solution depicted here went in the right direction. Problem: both stdin and stdout should be in the same configuration, because the echo of the console rewrites the input. The problem is the writing of the string with \uXXXX codes.... I am guessing how to overcome that or using #define's to overcome and clarify the text literals

edited Mar 17 '22 at 08:07

answered Mar 14 '22 at 17:29

Raul Luna

1,945
1
17
26

Send wide strings to`wcout` (`L"你好"`). You don’t have to use escape code if you save the source in UTF-8 and set the compiler source charset to UTF-8 if not the default. `/utf-8` on MSVC compiler. – Mark Tolonen Mar 16 '22 at 01:32

Is there a proper way to receive input from console in UTF-8 encoding?

1 Answers1