1

The following program can be compiled using msvc or mingw. However, the mingw version cannot display unicode correctly. Why? How can I fix that?

Code:

#include <stdio.h>
#include <windows.h>
#include <io.h>
#include <fcntl.h>

int wmain(void)
{
    _setmode(_fileno(stdout), _O_U16TEXT);
    _putws(L"哈哈哈");
    system("pause");

    return 0;
}

Mingw64 Compile Command:
i686-w64-mingw32-gcc -mconsole -municode play.c

MSVC Compiled:
enter image description here

Mingw Compiled:
enter image description here

Edit:
After some testing, the problem seems not causing by mingw. If I run the program directly by double clicking the app. The unicode string cannot be displayed correct either. The code page however, is the same, 437.

It turns out the problem is related to console font instead of the compiler. See the following demo code for changing console font.

  • Have you checked the codepage of your console? You can check the codepage by running ``chcp`` in your console. See also https://stackoverflow.com/questions/2492077/output-unicode-strings-in-windows-console-app – user1729210 Jan 02 '20 at 12:04
  • Later versions of MSCV (I think starting with Visual C++ 2010) and the newer Universal CRT (ucrt) have improved `_O_U16TEXT` mode for console output to use the console's wide-character API, and ucrt even supports the wide-character API for console input. These updates do not seem to be incorporated in msvcrt.dll, which tends to be more conservative because historically it has been the CRT of system libraries such as kernel32.dll. I think progress has been made to allow MinGW to link with ucrt instead of msvcrt.dll, but it's not the default configuration. – Eryk Sun Jan 02 '20 at 12:45
  • The wide-character API is your best option, even if it means you're forced to call `WriteConsoleW` and `ReadConsoleW` directly. Don't rely on the console's multibyte (i.e. code page) API if you need reliable support for Unicode, at least to the extent that the console supports it (i.e. no astral characters, composed characters, complex scripts, or font fallback). Support for UTF-8 (65001) is horribly buggy in Windows 7, and even in Windows 10 it's limited to output only. Setting the input codepage to UTF-8 is broken (well, it never worked) since non-ASCII characters are read as null bytes. – Eryk Sun Jan 02 '20 at 12:54

2 Answers2

1

This is happening because of missing #define UNICODE & #define _UNICODE . You should try adding it along with other headers. The _UNICODE symbol is used with headers such as tchar.h to direct standard C functions such as printf() and fopen() to the Unicode versions.

Please Note - The -municode option is still required when linking if Unicode mode is used.

Deepak Yadav
  • 321
  • 1
  • 8
  • 19
  • It's not working either. The compiler complains about macro UNICODE redefination. –  Jan 02 '20 at 13:53
1

After doing some research, it turns out the default console font does not support chainese glyphs. One can change the console font by using SetCurrentConsoleFontEx function.

Demo Code:

#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#endif

#include <stdio.h>
#include <io.h>
#include <fcntl.h>
#include <windows.h>

#define FF_SIMHEI 54

int main(int argc, char const *argv[])
{
    CONSOLE_FONT_INFOEX cfi = {0};

    cfi.cbSize = sizeof(CONSOLE_FONT_INFOEX);
    cfi.nFont = 0;
    cfi.dwFontSize.X = 8;
    cfi.dwFontSize.Y = 16;
    cfi.FontFamily = FF_SIMHEI;
    cfi.FontWeight = FW_NORMAL;
    wcscpy(cfi.FaceName, L"SimHei");

    SetCurrentConsoleFontEx(GetStdHandle(STD_OUTPUT_HANDLE), FALSE, &cfi);

    /* UTF-8 String */
    SetConsoleOutputCP(CP_UTF8); /* Thanks for Eryk Sun's notice: Remove this line if you are using windows 7 or 8 */
    puts(u8"UTF-8你好");

    /* UTF-16 String */
    _setmode(_fileno(stdout), _O_U16TEXT);
    _putws(L"UTF-16你好");

    system("pause");

    return 0;
}
  • Can you clarify *exactly* the circumstances that led to the different output result between the VC++ and MinGW builds? And remove the `SetConsoleOutputCP(CP_UTF8)` because some versions of the CRT `_putws` function may simply encode to the console codepage, and codepage 65001 is broken for output in Windows 7. – Eryk Sun Jan 03 '20 at 02:02
  • @ErykSun Sure, the font configuration for visual studio console and default console was different. When I ran the program built by mingw, it ran with the default windows console, which by default use consolas font. –  Jan 03 '20 at 05:39
  • Have you tested without `SetConsoleOutputCP(CP_UTF8)` in Windows 7 and 8? The old way the CRT wrote Unicode to the console was with a double translation from UTF-16 to the locale codepage and from the locale codepage to the active console codepage. Starting around VC++ 2010 they introduced support for `_O_U16TEXT` mode with console files that skipped double translation to directly write UTF-16 to the console via `WriteConsoleW`. That's supported in ucrt -- and maybe in msvcrt.dll in Windows 10, but probably not in older versions of msvcrt.dll in Windows 7 and 8. – Eryk Sun Jan 03 '20 at 09:43
  • @ErykSun No, however, ```SetConsoleOutputCP(CP_UTF8)``` is necessary for windows 10. It will output garbage otherwise. Since windows 7 is no longer supported, I decide not to remove it. I will add a note instead. –  Jan 04 '20 at 07:07
  • It's not necessary, and it's especially problematic for interacting with console applications that assume the console is using the default OEM codepage, since the input and output codepages are *global* settings for all applications attached to a console. A Unicode application should be independent of the legacy codepage setting. It should use the console's wide-character API. This also allows reading Unicode input as well, whereas setting the input codepage to UTF-8 via `SetConsoleCP(CP_UTF8)` is still broken even in Windows 10; non-ASCII characters are read as null bytes. – Eryk Sun Jan 04 '20 at 08:26
  • MinGW can link with the Universeral CRT (ucrt, i.e. the API sets associated with ucrtbase.dll) instead of msvcrt.dll. ucrt supports the console's wide-character API for both input and output when the file descriptor is set to `_O_U16TEXT` mode. For this, your program has to use wide-character strings only, however. So if you use UTF-8, you'll have to decode to UTF-16 before calling CRT functions. – Eryk Sun Jan 04 '20 at 08:30
  • @ErykSun I don't know what's your testing environment. But if I don't add ```SetConsoleCP(CP_UTF8)```, the console will print garbage. And yes, windows use UTF-16LE as default unicode encoding scheme. But you can always use u8 prefix to force compiler to store a UTF-8 strings instead of coverting wide characters. –  Jan 07 '20 at 06:15