6

This is my program:

#include <iostream>
#include <string>
#include <locale>
#include <clocale>
#include <codecvt>
#include <io.h>
#include <fcntl.h>

int main()
{
    fflush(stdout);
    _setmode(_fileno(stdout), _O_U16TEXT);
    std::ios_base::sync_with_stdio(false);
    std::setlocale(LC_ALL, "el_GR.utf8");
    std::locale loc{ "el_GR.utf8" };
    std::locale::global(loc);       // apparently this does not set the global locale
    //std::wcout.imbue(loc);
    //std::wcin.imbue(loc);

    std::wstring yes;
    std::wcout << L"It's all good γεια ναί" << L'\n';
    std::wcin >> yes;
    std::wcout << yes << L'\n';
    return 0;
}

Lets say I want to support greek encodings (for both input and output). This program works perfectly on Linux for various output and input languages if I set the appropriate encoding and of course remove the fflush(stdout) and _setmode().

So on Windows this program will output greek (and english) correctly when I use std::locale::global(loc), but It will not take greek input that I type from the keyboard. The std::wcout << yes outputs gibberish or question marks if I type greek. Apparently ::global isn't really global on Windows?

So I tried the .imbue() method on wcout and wcin (which also works on Linux) that you see commented out here. When I use any of these two statements and run the program it will (compile properly) present me with a prompt and when I press w/e and then press 'enter' it simply exits with no errors or whatnot.

I have tried a few Windows specific commands but then I got confused too. What should I try and when on Windows is not clear to me.

So the question is how I can both input and output greek text properly in Windows like in the program above? I use MSVS 2017 latest updates. Thanks in advance.

KeyC0de
  • 4,728
  • 8
  • 44
  • 68
  • Edit on "present me with a prompt and when I press": it won't present me with a prompt. It won't present me anything. Console just allows me to type and when I press enter It simply exits. Note that I have also set "Use Unicode Character Set" on MSVS and I also have Lucida console. – KeyC0de Sep 16 '19 at 11:35
  • 2
    Streams default to ANSI mode, and for the console this uses its legacy codepage API, so reading Greek input depends on the console input codepage being 737 or 1253, which can be set via `SetConsoleCP`. I think recent versions of the CRT allow reading from the console via its wide-character UTF-16 API instead. Try `_setmode(_fileno(stdin), _O_U16TEXT)`. – Eryk Sun Sep 16 '19 at 11:37
  • @ErykSun YES!! IT WORKS! I didn't need `SetConsoleCP`, just the `_setmode(_fileno(stdin), _O_U16TEXT)` did it. However, when I `.imbue` on `std::wcout` and `std::wcin` it will output error: "Debug Assertion Failed! Program: ... File: minkernel\crts\ucrt\src\appcrt\stdio\fgetc.cpp line: 50 Expression: ((_Stream.is_string_backed())) || (fn = _fileno(_Stream.public_stream()), ((_textmode_safe(fn) == __crt_lowio_text_mode::ansi) && !_tm_unicode_safe(fn)))) ..` – KeyC0de Sep 16 '19 at 11:43
  • BTW, support for a locale such as "el_GR.utf8" is very new to the Windows CRT. Windows Vista added support for BCP-47 language-tag locales in the OS. Recently the CRT extended this to start allowing underscore instead of just hyphen in BCP-47 locales, plus a ".utf8" or ".utf-8" encoding (no other encoding is allowed). This isn't even [documented yet](https://learn.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=vs-2019). – Eryk Sun Sep 16 '19 at 12:02
  • @BarmakShemirani, I would detect a console handle via `GetConsoleMode` and default to `_O_U16TEXT` only for a console, else `_O_U8TEXT` (UTF-8). Note that both of these Unicode text modes require wide-character strings. Its UTF-8 text support is not like neutral UTF-8 "get a byte" support in most Unix systems. Internally the CRT transcodes to and from UTF-16. – Eryk Sun Sep 16 '19 at 12:07
  • @ErykSun Yes for Windows cosole we have to use `_O_U16TEXT` but are you sure that for files UTF-8 and c++ would work, since windows doesn't play well with `char`s from what I've tried. Linux does it perfectly though. (I still wonder why the `.imbue` method doesn't work though) – KeyC0de Sep 16 '19 at 12:09
  • @Nikos, we read and write `wchar_t` strings with `_O_U8TEXT` mode. The CRT transcodes between UTF-16LE and UTF-8 for I/O, but in memory we work with wide-character strings. – Eryk Sun Sep 16 '19 at 12:13
  • @ErykSun What do you mean `we`? You mean it's possible to do work with `_O_U8TEXT` and `wchar_t`s and non-english text? Because I just tried that and it doesn't output properly the non-english text. Btw you can add an answer and I will accept it. – KeyC0de Sep 16 '19 at 12:24
  • 1
    @Nikos, back up a few comments to where I discussed detecting the console via `GetConsoleMode` (success). If we have a console, then we have to use `_O_U16TEXT` because using UTF-8 with the console's legacy codepage API isn't generally supported. (Windows 8+ supports UTF-8 output via codepage 65001, but UTF-8 input is still broken for all versions.) – Eryk Sun Sep 16 '19 at 12:31
  • 1
    You can read/write user data in UTF16, and read/write files in UTF8 for compatiblity. Use `WideCharToMultiByte(CP_UTF8,...)/MultiByteToWideChar` for UTF16/UTF8 conversion. – Barmak Shemirani Sep 16 '19 at 16:16

1 Answers1

0

As @Eryk Sun mentioned in the comments I had to use _setmode(_fileno(stdin), _O_U16TEXT);

Windows UTF-8 console inputs is still (as of 2019) somewhat broken.

EDIT:

The above modification wasn't enough. I now do the following whenever I want to support UTF-8 code page and UNICODE input/output on Windows (read the code comments for more info).

int main()
{
    fflush( stdout );
#if defined _MSC_VER
#   pragma region WIN_UNICODE_SUPPORT_MAIN
#endif
#if defined _WIN32
    // change code page to UTF-8 UNICODE
    if ( !IsValidCodePage( CP_UTF8 ) )
    {
        return GetLastError();
    }
    if ( !SetConsoleCP( CP_UTF8 ) )
    {
        return GetLastError();
    }
    if ( !SetConsoleOutputCP( CP_UTF8 ) )
    {
        return GetLastError();
    }
    
    // change console font - post Windows Vista only
    HANDLE hStdOut = GetStdHandle( STD_OUTPUT_HANDLE );
    CONSOLE_FONT_INFOEX cfie;
    const auto sz = sizeof( CONSOLE_FONT_INFOEX );
    ZeroMemory( &cfie, sz );
    cfie.cbSize = sz;
    cfie.dwFontSize.Y = 14;
    wcscpy_s( cfie.FaceName,
        L"Lucida Console" );
    SetCurrentConsoleFontEx( hStdOut,
        false,
        &cfie );
        
    // change file stream translation mode
    _setmode( _fileno( stdout ), _O_U16TEXT );
    _setmode( _fileno( stderr ), _O_U16TEXT );
    _setmode( _fileno( stdin ), _O_U16TEXT );
#endif
#if defined _MSC_VER
#   pragma endregion
#endif
    std::ios_base::sync_with_stdio( false );
    // program:...

    return 0;
}

Guidelines:

  • Use "Use Windows Character Set" in Project Properties -> General -> Character Set
  • Make sure you use a terminal font that supports unicode utf-8 (Open a Console -> Properties -> Font -> "Lucida console" is ideal on Windows). The code above sets that automatically.
  • Use string and 8 bit chars.
  • Use 16 bit chars (wchar_t, wstring etc.) to interact with the Windows console
  • Use 8bit chars/string at application boundary (eg write to files, interact with other OSs etc.)
  • Convert string|char to wstring|wchar_t for interacting with the Windows APIs
KeyC0de
  • 4,728
  • 8
  • 44
  • 68
  • Is there any plan to fix this in the future? Still broken Nov 13 '20... – varvir Nov 13 '20 at 02:15
  • @varvir Windows 10 is heading towards a new console "[ecosystem](https://learn.microsoft.com/en-us/windows/console/ecosystem-roadmap)". Read more about it. I have updated my answer. – KeyC0de Nov 13 '20 at 12:00
  • I solved the utf8 input issue with the following just 2 lines. `_setmode( _fileno( stdout ), _O_U8TEXT ); _setmode( _fileno( stderr ), _O_U8TEXT );` Why is this not enough? Sorry, I have no idea of Win32 API and C runtime. And `SetConsoleCP` `SetConsoleOutputCP` doesn't work? [ref1](https://stackoverflow.com/questions/22950412/c-cant-get-wcout-to-print-unicode-and-leave-cout-working) [ref2](https://stackoverflow.com/questions/45232484/c-crash-when-use-setmode-with-o-u8text-to-deal-with-unicode) [ref3](https://learn.microsoft.com/ko-kr/windows/console/setconsolecp) – varvir Nov 13 '20 at 14:24
  • @varvir `_setmode` isn't enough because it just sets the file steam character translation mode. Firstly set the code page to Unicode UTF-8 (because by default it is an ANSI codepage variant named "OEM"). Secondly apply an appropriate font for the windows console supporting the additional glyphs. The "Lucida Console" does the job. Finally make sure your Project Properties -> `General` -> `Character Set` is set to `Use Unicode`. Use `wcout`, `wcin`, `wstring` and family. You can use 8bit chars to write/read from file system files though. Read the question's comments and the code comments. – KeyC0de Nov 13 '20 at 15:21