output utf8 in console with Visual Studio (wide stream)

Question

This piece of code works if i compiled it with mingw32 on windows 10. and emits right result, as you can see below :

C:\prj\cd>bin\main.exe
1°à€3§4ç5@の,は,でした,象形字 ;

Indeed when i try to compile it with Visual Studio 17, same code emits wrong chracters

/out:prova.exe
prova.obj

C:\prj\cd>prova.exe
1Â°Ã â‚¬3Â§4Ã§5@ã®,ã¯,ã§ã—ãŸ,è±¡å½¢å— ;

C:\prj\cd>

here source code :

#include <windows.h>
#include <io.h>
#include <fcntl.h>
#include <stdio.h>
#include <string>
#include <iostream>

int main ( void )
{
    _wsetlocale(LC_ALL, L"it_IT.UTF-8" );   // set locale wide string
    _setmode(_fileno(stdout), _O_U8TEXT);   // set Locale for console
    SetConsoleCP( CP_UTF8 ) ;               
    SetConsoleOutputCP(CP_UTF8);

    // Enable buffering to prevent VS from chopping up UTF-8 byte sequences
    setvbuf(stdout, nullptr, _IOFBF, 1000);

    std::wstring test = L"1°à€3§4ç5@の,は,でした,象形字 ;";
    std::wcout << test << std::endl;

}

I have read several topics :

How to print UTF-8 strings to std::cout on Windows?

How to make std::wofstream write UTF-8?

and many others, but somehtins goes wrong ... can you help me ?

Why would you want to print using wcout or use wstrings when you are trying to use UTF8 to begin with? You are mixing up two different encodings here. `std::string test = "1°à€3§4ç5@の,は,でした,象形字 ;";` and `cout` works just fine here with pretty much any modern editor/compiler, the _output_ (storage is ok!) is usually incorrect on *windows* (which is utf16, sigh) but you seem to correct that issue with the first 5 lines. — ricco19, May 04 '18 at 17:12
because in an other lib i have to read a file , which return several std::wstring, however with mingw32 works is with VS that something goes wrong ... — , May 04 '18 at 17:21
I'm guessing that mingw32 must be doing some conversion in the background. But I do find it strange that Windows does almost everything in wide chars, but then doesn't want to use UTF-16 on the console. — wally, May 04 '18 at 17:28
You have the console confused; you have told it you are going to be outputting UTF-8 and then you output UTF-16LE. If you want the console to correctly interpret a UTF-8 stream then you have to output a UTF-8 stream. `std::cout` is suitable for outputting UTF-8. — Richard Critten, May 04 '18 at 17:49

wally · Accepted Answer · 2018-05-04T17:26:39.843

1

The following works for me:

#include <string>
#include <iostream>
#include <Windows.h>

int main(void)
{
    // use utf8 literal
    std::string test = u8"1°à€3§4ç5@の,は,でした,象形字 ;"; 

    // set code page to utf8
    SetConsoleOutputCP(CP_UTF8);                        

    // Enable buffering to prevent VS from chopping up UTF-8 byte sequences
    setvbuf(stdout, nullptr, _IOFBF, 1000);

    // printing std::string to std::cout, not std::wstring to std::wcout
    std::cout << test << std::endl; 
}

But I had to change the font to SimSun-ExtB:

Then all the characters are shown:

edited May 04 '18 at 17:26

answered May 04 '18 at 16:46

wally

10,717
5
39
72

I don't think the [UTF-16 codepage](https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx) is available on the windows console. – wally May 04 '18 at 17:22
2

Codepage 65001 (`CP_UTF8`) is really wrong. It will misbehave very badly for non-ASCII output prior to Windows 8, and even in Windows 10, it does not support non-ASCII input. At best non-ASCII input characters are returned as NUL bytes, and at worst, prior to Windows 10, the whole read looks like EOF. – Eryk Sun May 04 '18 at 21:03
1

Of course the console wide-character API (e.g. `ReadConsoleW`, `WriteConsoleW`) is UTF-16LE, like the rest of Windows. However, internally, the console is UCS-2, i.e. surrogate pairs for characters beyond the BMP are rendered incorrectly as two default characters (e.g. two empty boxes). – Eryk Sun May 04 '18 at 21:04
The way this is handled properly in MSVC is to detect whether stdout is a console and, if so, switch the stream's mode to UTF-16 text mode, e.g. `_setmode(_fileno(stdout), _O_U16TEXT)`. Then transcode UTF-8 to UTF-16LE when writing to stdout. This is the only way that ensures the CRT will use the console's wide-character API. – Eryk Sun May 04 '18 at 21:08
Do you mean an example of the improper output prior to Windows 8? I don't have a screenshot handy, but here's an example of the incorrect output. Say you write `"αβγδεζηθι\n"` as a UTF-8 encoded string with the console output codepage set to 65001. Due to the console bug (the CRT isn't at fault), the CRT will write this string once and then several partial writes because it thinks the whole string wasn't written properly. The final output will be as if you wrote the following string: `"αβγδεζηθι\n\n�ηθι\n\n�\n\n"`. – Eryk Sun May 04 '18 at 21:20
@eryksun, I believe you. I meant an example of how to do it the 'right' way from `u8` literal to command prompt. I feel like there is more to it than just calling `_setmode(_fileno(stdout), _O_U16TEXT)`? Having to change the font also bothers me. – wally May 04 '18 at 21:23
You don't have a choice about using a TrueType font. The write can actually fail if you're using an OEM raster font, which is the default in Windows 7. Also, the console doesn't support DirectWrite automatic fallback fonts. This means while you can write any UTF-16LE text to a console that's configured with a TrueType font, only the font's implemented subset of BMP characters will render correctly. Unsupported characters will print as the font's default glyph, typically an empty box. – Eryk Sun May 04 '18 at 21:28
But the screen buffer itself is still UTF-16LE, and you can copy text from the console regardless of whether it's rendered correctly in the console. – Eryk Sun May 04 '18 at 21:30

output utf8 in console with Visual Studio (wide stream)

1 Answers1

Linked