1

A simple problem: I'm writing a chatroom program in C++ (but it's primarily C-style) for a class, and I'm trying to print, “#help — display a list of commands...” to the output window. While I could use two hyphens (--) to achieve roughly the same effect, I'd rather use an em-dash (—). printf(), however, doesn't seem to support printing em-dashes. Instead, the console just prints out the character, ù, in its place, despite the fact that entering em-dashes directly into the prompt works fine.

How do I get this simple Unicode character to show up?

Looking at Windows alt key codes, I find it interesting how alt+0151 is "—" and alt+151 is "ù". Is this related to my problem, or a simple coincidence?

Eryk Sun
  • 33,190
  • 5
  • 92
  • 111
  • 2
    The problem is the windows console uses code-page(s) not Unicode. – Richard Critten Sep 14 '17 at 19:16
  • 1
    The problem is that em-dash is an unicode character and you try to print it in an ascii string – code1x1.de Sep 14 '17 at 19:17
  • 2
    Try this std::wcout << wchar_t(0x2014); and read that thread https://stackoverflow.com/questions/33029906/is-it-possible-to-cout-an-em-dash-on-linux-and-windows-c – code1x1.de Sep 14 '17 at 19:17
  • you need or use unicode output `WriteConsoleW` or first convert unicode to multibyte by using `WideCharToMultiByte(GetConsoleOutputCP(),..)` for use in *A* output function – RbMm Sep 14 '17 at 19:38
  • @RichardCritten - for windows console native is exactly Unicode. and if use unicode output - no any problems. the code-page this is simply current value for performing the conversion from multi-byte to unicode – RbMm Sep 14 '17 at 19:41
  • @RbMm a C statement `printf ("—\n");` run in my Windows console outputs `ÔÇö`. – Weather Vane Sep 14 '17 at 19:47
  • @WeatherVane - and so what ? you need use `WriteConsoleW` with `L"—\n"`. are you understand why error when you use ansi version ? because used another code page (by default `CP_OEMCP`) to translate your string to unicode (in your src your `CP_ACP` is used) – RbMm Sep 14 '17 at 19:57
  • @RbMm which is why I ticked up the first comment from Richard. – Weather Vane Sep 14 '17 at 20:03
  • @WeatherVane - the comment is wrong. windows is unicode system and unicode used almost anywhere. in console as well. windows console is unicode. when you pass unicode string to print - it print it as is. and L"—\n" displayed correct. when you use ansi function to output - console first **translate** multibyte string to unicode. error that your source code and console use **different** code pages for translate – RbMm Sep 14 '17 at 20:09
  • @RbMm I did not realise at first you were talking about [Windows console functions](https://learn.microsoft.com/en-us/windows/console/console-functions) and not Windows console. – Weather Vane Sep 14 '17 at 20:18
  • @WeatherVane - i try say that for windows console unicode is native. all text printed as unicode only. when the *A* api version is called - all string data is first translated to unicode and then called *W* api version. error when used *A* version (or crt shell) in wrong code-page translation – RbMm Sep 14 '17 at 20:25
  • 1
    The answer linked by @sata300.de is the key for doing this conveniently in many cases, i.e. call `_setmode(_fileno(stdout), _O_U16TEXT)` at program startup and use wide-character C/C++ I/O such as `wprintf` and `std::wcout`. – Eryk Sun Sep 14 '17 at 21:18
  • 1
    The upvoted comment from @RichardCritten is probably just worded vaguely. I think it's referring to how the console (e.g. conhost.exe) is decoding the bytes written to it using its current output codepage (i.e. `GetConsoleOutputCP`). I don't think the comment means the console in general doesn't support Unicode. Though regarding the latter, the console is limited to the BMP (e.g. a surrogate code displays as a default character rather than decoding UTF-16 surrogate pairs); doesn't support combining codes; and requires a monospace font with glyphs for the characters (manual font linking helps). – Eryk Sun Sep 14 '17 at 21:26
  • @eryksun - `current output codepage` - this is absolute incorrect sentence. the console output always in unicode. the `GetConsoleOutputCP` - this is code page to **translate** multi-byte string to unicode, before display it – RbMm Sep 14 '17 at 22:54
  • @RbMm, maybe you're just misunderstanding what I wrote. I said "the console ... is decoding the bytes written to it using its current output codepage" (the latter is Microsoft's terminology). For example, `WriteFile` is called with a byte string. In Windows 8+ this calls `NtWriteFile` for the given File on the ConDrv device. The attached console (conhost.exe) is waiting on `NtDeviceIoControlFile`, which completes with the request to write the given bytes to the target screen buffer. The console first decodes these bytes using its "output codepage" by calling `MultiByteToWideChar` and the like. – Eryk Sun Sep 14 '17 at 23:09
  • @RbMm, if you don't like the term ["output codepage"](https://learn.microsoft.com/en-us/windows/console/getconsoleoutputcp), take it up with Microsoft. – Eryk Sun Sep 14 '17 at 23:11
  • @eryksun - but read next on this page - *A console uses its output code page to translate the character values written by the various output functions into the images displayed in the console window.* so this is used for **translate** but not for **output**. the output is always in unicode – RbMm Sep 14 '17 at 23:14
  • @RbMm, I just used the name of the codepage that's returned by `GetConsoleOutputCP`, i.e. "output codepage". You're taking issue with the name as far as I can tell. Nothing I said is wrong about the operations. – Eryk Sun Sep 14 '17 at 23:16
  • @eryksun - in this case this is bad written in msdn. correct say "translation codepage". and all why i try explain - all errors due incorrect translations (2 translations is used unicode->multibyte->unicode in most case with different code pages). only one way avoid this translation - use `WriteConsoleW` – RbMm Sep 14 '17 at 23:19
  • @RbMm, the simple (but still non-portable) way is via `_setmode(_fileno(stdout), _O_U16TEXT)` and then use the wide-character CRT functions such as `wprintf`. It's not extremely efficient since the CRT ends up calling `_putwch_nolock` in a loop over the characters, and thus makes a `WriteConsoleW` call for each character. But this is interactive console I/O, so we don't need extreme speed and efficiency. – Eryk Sun Sep 14 '17 at 23:30
  • @eryksun - yes, with `_setmode(_fileno(stdout), _O_U16TEXT)` `wprintf` begin use `WriteConsoleW` (char by char) instead `WriteFile`. but i personally not understand at all - for what have all this problems with CRT and/or ansi output when can simply call `WriteConsoleW` and have no any problems at all – RbMm Sep 14 '17 at 23:50
  • @RbMm, it's easier when writing cross-platform code and adapting existing code. – Eryk Sun Sep 14 '17 at 23:53
  • @eryksun - if try cross-platform code may be yes. anyway this is not easy. if write for windows only - need use `WriteConsoleW` and main - the `printf` any way display `—` as `-`. only `WriteConsoleW` give correct display – RbMm Sep 15 '17 at 00:01
  • You can also call `SetConsoleOutputCP(CP_UTF8)`. The `/utf-8` compiler option forces UTF-8 string literals. I wouldn't use this prior to Windows 8, in which case `WriteFile` to the console incorrectly returns the number of decoded characters written instead of the number of bytes written. Also, `SetConsoleCP(CP_UTF8)` is useless for non-ASCII input in all versions because the console makes the buggy assumption that it's encoding to ANSI (e.g. 1 byte per character) when it sizes the buffer for `WideCharToMultiByte`, which fails and yet `ReadFile` 'succeeds' at reading zero bytes, i.e. EOF. – Eryk Sun Sep 15 '17 at 00:22
  • Actually in Windows 10.0.15063 (Creators Update) reading input containing non-ASCII characters in `CP_UTF8` (65001) is a bit 'improved'. Apparently before encoding now they simply replace all non-ASCII characters with a Unicode NUL, so it doesn't look like EOF at least. It's just that all non-ASCII input characters end up as "\x00" in the buffer. – Eryk Sun Sep 15 '17 at 00:31

2 Answers2

0

the windows is unicode (UTF-16) system. console unicode as well. if you want print unicode text - you need (and this is most effective) use WriteConsoleW

BOOL PrintString(PCWSTR psz)
{
    DWORD n;
    return WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), psz, (ULONG)wcslen(psz), &n, 0);
}
PrintString(L"—");

in this case in your binary file will be wide character (2 bytes 0x2014) and console print it as is.

if ansi (multi-byte) function is called for output console - like WriteConsoleA or WriteFile - console first translate multi-byte string to unicode via MultiByteToWideChar and in place CodePage will be used value returned by GetConsoleOutputCP. and here (translation) can be problem if you use characters > 0x80

first of all compiler can give you warning: The file contains a character that cannot be represented in the current code page (number). Save the file in Unicode format to prevent data loss. (C4819). but even after you save source file in Unicode format, can be next:

wprintf(L"ù"); // no warning
printf("ù"); //warning C4566

because L"ù" saved as wide char string (as is) in binary file - here all ok and no any problems and warning. but "ù" is saved as char string (single byte string). compiler need convert wide string "ù" from source file to multi-byte string in binary (.obj file, from which linker create pe than). and compiler use for this WideCharToMultiByte with CP_ACP (The current system default Windows ANSI code page.)

so what happens if you say call printf("ù"); ?

  1. unicode string "ù" will be converted to multi-byte WideCharToMultiByte(CP_ACP, ) and this will be at compile time. resulting multi-byte string will be saved in binary file
  2. the console it run-time convert your multi-byte string to wide char by MultiByteToWideChar(GetConsoleOutputCP(), ..) and print this string

so you got 2 conversions: unicode -> CP_ACP -> multi-byte -> GetConsoleOutputCP() -> unicode

by default GetConsoleOutputCP() == CP_OEMCP != CP_ACP even if you run program on computer where you compile it. (on another computer with another CP_OEMCP especially)

problem in incompatible conversions - different code pages used. but even if you change console code page to your CP_ACP - convertion anyway can wrong translate some characters.

and about CRT api wprintf - here situation is next:

the wprintf first convert given string from unicode to multi-byte by using it internal current locale (and note that crt locale independent and different from console locale). and then call WriteFile with multi-byte string. console convert back this multi-bytes string to unicode

unicode -> current_crt_locale -> multi-byte -> GetConsoleOutputCP() -> unicode

so for use wprintf we need first set current crt locale to GetConsoleOutputCP()

char sz[16];
sprintf(sz, ".%u", GetConsoleOutputCP());
setlocale(LC_ALL, sz);
wprintf(L"—");

but anyway here i view (on my comp) - on screen instead . so will be -— if call PrintString(L"—"); (which used WriteConsoleW) just after this.

so only reliable way print any unicode characters (supported by windows) - use WriteConsoleW api.

RbMm
  • 31,280
  • 3
  • 35
  • 56
0

After going through the comments, I've found eryksun's solution to be the simplest (...and the most comprehensible):

#include <stdio.h>
#include <io.h>
#include <fcntl.h>

int main()
{
    //other stuff
    _setmode(_fileno(stdout), _O_U16TEXT);
    wprintf(L"#help — display a list of commands...");

Portability isn't a concern of mine, and this solves my initial problem—no more ù—my beloved em-dash is on display.

I acknowledge this question is essentially a duplicate of the one linked by sata300.de. Albeit, with printf in the place of cout, and unnecessary ramblings in the place of relevant information.