This answer assumes you are working in MS Windows
It's pretty sad that we're in 2018 and this stuff all still doesn't work properly . But here is the state of things:
printf("\xE2\x98\xA0");
(which is the same as printf("%s", "\xE2\x98\xA0");
) works because you are just outputting 3 characters to the output stream. There is no Unicode or special character processing occuring in the C language. It is your terminal environment which looks for UTF-8 strings in the output and chooses display glyphs accordingly.
Similarly, if you wrote the output to a file (using fprintf
, or stream redirection) you would see the file contains 0xE2, 0x98, 0xA0
and then you may choose to use a text file viewer that converts UTF-8 to display glyphs.
This part is all fine, and you can (and probably should) write your program to only ever write UTF-8 encoded characters to FILE
streams.
The problem starts when we want to output wchar_t
characters. In theory this should work:
printf("%ls", L"\u2620");
What is supposed to happen is that wcstombs
is called to convert the unicode code point sequence into a multi-byte sequence. But which multi-byte format to use? UTF-8 has become ubiquitous now, but in the past there were also other formats like ShiftJIS, Big-5 etc.
You have to specify the multibyte format by using setlocale
. And the details of locales are implementation-defined.
Here's the kicker. There is no C locale supported by Windows for general UTF-8 output. If you try setlocale(LC_CTYPE, ".65001");
it just doesn't work.
You can output certain subsets of Unicode by using a supported locale. For example the MSDN example using Japanese_Japan.932
works, outputting the Unicode input as Shift-JIS. (Not UTF-8).
What's worse is that if you use the Windows API function WideStringToMultiByte
, it does accept the "locale" of CP_UTF8
. You can use this function to convert L"\u2620";
to a char
buffer and printf
that, producing UTF-8 output.
But of course you cannot "plug this in" to the FILE
stream processing, which only calls wcstombs
and not WideStringToMultiByte
.
Why didn't they allow ".UTF-8"
as a locale for wcstombs
? Malicious behaviour? Who knows.
The next thing that should work in theory is:
FILE *fp = fopen("a.txt", "w");
fwide(fp, 1);
fwprintf(fp, L"\u2620");
However in actuality, the MS runtime doesn't actually do anything with fwide
; it doesn't support wide-oriented streams. The Microsoft implementations of wprintf
family actually just output narrow characters, not wide characters, and they use the same wcstombs
method that the narrow printf family did.
So, that code doesn't work, and the code from the Japanese wcstombs example, fwprintf(fp, L"\u3603");
(with the .932 CP set) outputs the multibyte sequence instead of the raw wide character.
To write a UTF-16 file via the stdio.h
API you actually have no choice but to use narrow characters and treat it like a binary file.