1

I am reading the well-known answer about string and wstring and come up some confusion.

source charset and execution charset are all set as utf-8, Windows x64, VC++ compiler, git bash console (can print unicode characters), system default codepage 936(GB2312).

My expertiment code:

#include <cstring>
#include <iostream>
using namespace std;
int main(int argc, char* argv[])
{
    wchar_t c[] = L"olé";
    wchar_t d[] = L"abc";
    wcout << c << endl;
    wcout << d << endl;

    return 0;
}

Can print "abc" but can't print "é".

I understand that wchar_t is used along with L prefix string literal. And under Windows wchar_t is encoded with UTF-16(It's hard coded right? No matter what source charset or execution charset I choose, L"abc" would always have the same UTF-16 code units).

The question is:How can it wcout a UTF-16 encoded string("abc"), while my source file is utf-8 and execution charset is utf-8. The program should not be able to recognize UTF-16 encoded stuff unless I set everything to utf-16.

And if it can print UTF-16 in some way, then why can't it print é?

Rick
  • 7,007
  • 2
  • 49
  • 79
  • 1
    I haven't tested myself cause I can't compile c++ on my box these days (shame on me). But I find your question super interesting and I google a bit. I think M.M is right and this is related with the console capabilites. Look for the windows specific _setmode(_fileno(stdout), _O_U16TEXT);... I would wildly guess that wcout doesn't forward the wchar_t* plainly but does some translation, that's how you see some chars printed correctly and not the é. – rodix May 31 '18 at 04:53
  • Yes, I know Windows console does not support utf-8. So I tried it on Git bash( I think it support utf-8). Isn't windows console does not support utf-x at all? – Rick May 31 '18 at 04:57
  • Windows console seems to support utf-8 chars, you can paste rusians chars on your cmd. I think that wcout can print "abc" easily because in ASCII, UTF-8 and UTF-16, "a" is coded as 0x61 in the last byte. Pretty much the same for "b" and "c". But "é" in UTF-16 is a 0xe9 at the end, and no UTF-8 char ends with a 0xe9. I guess there's no legacy or trivial translation between "é" in UTF-16 and "é" in UTF-8. You need to set your console ready to the stdout of your app if you are planning to send UTF-16 (and use non-legacy chars) – rodix May 31 '18 at 05:31

2 Answers2

4

You need a non-standard Windows system call to enable UTF-16 output.

#include <iostream>
#include <io.h>
#include <fcntl.h>
#include <stdio.h>

int main()
{
    _setmode(_fileno(stdout), _O_U16TEXT); // <=== Windows madness
    std::wcout << L"olé\n";
}

Note you cannot use cout after doing this, only wcout.

Also note your source code file must have BOM, otherwise the compiler will not recognise it as Unicode.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
0

The Windows Console does not support UTF-16 output. It only supports 8-bit output, and it has partial support for 8-bit MBCS, such as Big5 or UTF-8.

To display Unicode characters on the console you will need to do conversion to UTF-8 or another MBCS in your code, and also put the console into UTF-8 mode (which requires an undocumented system call).

See also this answer

M.M
  • 138,810
  • 21
  • 208
  • 365
  • Then why should I use `wchar_t` instead of setting everything under utf-8? I mean, it seems useless. – Rick May 31 '18 at 04:38
  • @Rick Well that is up to you. You might want to use it for interacting with the UTF-16 version of Windows API, and using other output methods besides the console (e.g. Windows API calls that write to files or windows) – M.M May 31 '18 at 04:39
  • But how can `L"abc"` be printed? They are encoded with UTF-16 too. – Rick May 31 '18 at 04:41
  • I think it outputs `NUL 'a' NUL 'b' NUL 'c'` and the console ignores the nulls – M.M May 31 '18 at 04:42
  • ok. Then it kind of makes sense. Thanks! Btw, anyway to observe that `NUL 'a' NUL 'b' NUL 'c' `? – Rick May 31 '18 at 04:44
  • Wait, Windows Console support utf-8? I think that's not true. – Rick May 31 '18 at 05:01
  • 2
    @Rick: It supports it, sort of, [via code pages](https://superuser.com/q/269818/556135) for any program using normal file I/O APIs. If the program in question knows how to use [the console I/O API](https://learn.microsoft.com/en-us/windows/console/high-level-console-i-o) (separate from the regular file I/O API), that API provides full support for Unicode using UTF-16, but that API isn't as well known, and requires the program to support it explicitly using Windows specific calls (and dynamically choosing to use file or console APIs by detecting handle redirection). – ShadowRanger May 31 '18 at 05:10
  • @ShadowRanger oh, that hack thing. Thanks for reminding me about that. – Rick May 31 '18 at 05:12
  • You also may need to do `_setmode` with `_O_UTF8` – M.M May 31 '18 at 05:50
  • @M.M hey. I would like to ask again, even if a console support UTF-16, I can't get the correct output right? Because I set everything under utf8. – Rick May 31 '18 at 05:54
  • @Rick the Windows console doesn't support UTF-16 – M.M May 31 '18 at 05:56