Unicode literal - how does this even make sense?

Question

int main() {    
    std::cout << "\u2654" << std::endl; // Result #1: ♔
    std::cout << U'\u2654' << std::endl; // Result #2: 9812
    std::cout << U'♔' << std::endl; // Result #3: 9812
    return 0;
}

I am having trouble understanding how Unicode works with C++. Why does not the literal output the literal in the terminal?

I kind of want something like this to work;

char32_t txt_representation() { return /* Unicode codepoint */; }

Note: the source is UTF-8 and so is the terminal, sitting on macOS Sierra, CLion.

What do you think is the type of `L'\u2654'` etc... (hint: it is not a `char`). And no `"\2654"` is *not* a `std::string` — Basile Starynkevitch, Dec 12 '16 at 18:17
Only the first one is (likely) `utf8`, the second 2 are either `utf32` or `ucs16` wide character encodings. To be sure with the first one use `u8"\u2654"`. — Galik, Dec 12 '16 at 18:19
@Galik: Oh you right! I will change. I am pretty confused about all of this encodings so please have some understanding.. — Entalpi, Dec 12 '16 at 18:20
If you use `L""` and `std::wstring` you need to convert to and from `utf8` (if your terminal is `utf8`). — Galik, Dec 12 '16 at 18:23
@Galik Formatted I/O is _supposed_ to do that for you automatically. — zwol, Dec 12 '16 at 18:45
Getting those Unicode characters presented in a console is highly system dependent. However, for the common desktop systems it's possible to do it via wide streams (e.g. `std::wcout`), with a separate translation unit providing the necessary system specific stream configuration. It's a little shameful: the shoemaker's children are the only ones lacking proper shoes. — Cheers and hth. - Alf, Dec 12 '16 at 19:07
Follow-up question which may be the question Entalpi wanted to ask: https://stackoverflow.com/questions/41107667/iostreams-print-wchar-t-or-charxx-t-value-as-a-character — zwol, Dec 12 '16 at 19:11

zwol · Answer 1 · 2016-12-12T19:19:58.160

C++ doesn't really have the concept of "character" in its type system. char, wchar_t, char16_t, and char32_t are all considered to be kinds of integer. As a consequence, character literals like 'x', L'x', U'x' are all numbers. There is an operator<< specifically for char, which is why

cout << "endl is almost never necessary" << '\n';

does the same thing as

cout << "endl is almost never necessary\n";

but there aren't analogues for *char_t, so your wide character literals are being silently converted to int and printed as such. I personally never use iostreams and therefore I don't actually know how to persuade operator<< to print a number as its Unicode codepoint, but there's probably some way to do it.

There's a stronger distinction between "string" and "array of integers" in the type system, so you do get the output you expect when you supply a string literal. Note, however, that cout << L"♔" won't give the output you expect, and cout << "♔" isn't even guaranteed to compile. cout << u8"♔" will work on a C++11-compliant system where the narrow character encoding is in fact UTF-8, but will probably produce mojibake if the character encoding is something else.

(Yes, this is all much more complicated and less useful than it has any excuse for being. This is partially because of backward compatibility constraints inherited from C, partially because it was all designed back in the 1990s, before Unicode took over the world, and partially because many of the design errors in the C++ string and stream classes were not apparent as errors until it was too late to fix them.)

Do note that there is `std::wcout` which can handle wide character output (`std::wstring`/`wchar_t*`). — NathanOliver, Dec 12 '16 at 19:01
@NathanOliver Yes, but then you can't print narrow characters. — zwol, Dec 12 '16 at 19:09
@zwol: "you can't print narrow characters" is incorrect. However, the implementation of the limited functionality for that (only `char const*` is supported) may be sort of lacking, depending on the compiler. — Cheers and hth. - Alf, Dec 12 '16 at 19:10
How nice and subtle way of pointing out the overuse of `std::endl` in the original post and the solution at the same time :-) — The Vee, Dec 12 '16 at 21:42

score 2 · Answer 2 · answered Dec 12 '16 at 19:07

Printing wide characters to narrow streams is not supported and doesn't work at all. (It "works" but the result is not what you want).

Printing multibyte narrow strings to wide streams is not supported and doesn't work at all. (It "works" but the result is not what you want).

On a Unicode-ready system, std::cout << "\u2654" works as expected. So does std::cout << u8"\u2654". Most properly set up Unix-based operating systems are Unicode-ready.

On a Unicode-ready system, std::wcout << L'\u2654' should work as expected if you set up your program locale properly. This is done with this call:

 ::setlocale(LC_ALL, "");

or this

 ::std::locale::global(::std::locale(""));

Note "should"; with some compilers/libraries this method may not work at all. It's a deficiency with these compilers/libraries. I'm looking at you, libc++. It may or may not officially be a bug, but I view it as a bug.

You should really set up your locale in all programs that wish to work with Unicode, even if this doesn't appear necessary.

Mixing cout and wcout in the same program does not work and is not supported.

std::wcout << U'\u2654' does not work because this is mixing a wchar_t stream with a char32_t character. wchar_t and char32_t are different types. I guess a properly set up std::basic_stream<char32_t> would work with char32_t strings, bit the standard library doesn't provide any.

char32_t based strings are good for storing and processing Unicode code points. Do not use them for formatted input and output directly. std::wstring_convert can be used to convert them back and forth.

TL;DR work with either std::streams and std::strings, or (if you are not on libc++) std::wstreams and std::wstrings.

Christophe · Accepted Answer · 2016-12-12T20:42:02.133

Unicode and C++

There are several unicode encodings:

UTF-8 encodes each unicode character into a sequence of one to four (8- bit) bytes (char)
UTF-16 (which can be BE and LE depending on endianness) encodes each unicode character into a sequence of one or two 16 bit words (char16_t).
UTF-32 (again BE or LE) encodes each unicode character into one 32 bit word (char32_t).

Here is an excellent video tutorial on unicode with C++ by James McNellis. He explains everything you need to know on character set encoding, on unicode and its different encodings, and how to use it in C++.

Your code

"\u2654" is a a narrow string literal, that has the type array of char. The white chess king unicode character will be encoded as 3 consecutive chars corresponding to the UTF-8 encoding ({ 0xe2, 0x99, 0x94 }). As we are in a string, there is no problem of having several chars in it. As your console locale certainly uses UTF8, it will interpret correctly decode the sequence when the string is displayed.

U'\u2654' is a character literal of type char32_t (because of the uppercase U). As it is a char32_t (and not a char), it is not displayed as a char, but as an integer value. The value in decimal is 9812. Whould you use hex, you would have recognized it immediately.

The last U'♔' obeys the same logic. Be aware however that you embed a unicode character in the source code. This is fine as long as the editor's character encoding matches the source code encoding expected by the compiler. But this could cause mismatches if file would be copied (without conversion) to environments expecting a different encoding.

The last paragraph is not correct wrt. the Holy Standard, which requires a translation from source code encoding to Unicode. And down-conversion to whatever encoding is used for narrow string literals. Check out the phases of translation. It's not entirely clear for raw string literals, but that's a very fine detail. — Cheers and hth. - Alf, Dec 12 '16 at 19:51
@Cheersandhth.-Alf Thanks for highlighting my ambiguous formulation. You are completely right about the up and down mapping. But I wanted to draw attention on the case where source file encoding does not match compiler's expected input encoding, for example if the UTF-8 file would be copied to a non unicode environment and the implementation dependent mapping would interpret as a different local char that would be mapped to a different unicode codepoint - But ok, Win1252 is not so common anymore ;-) I've edited to clarify — Christophe, Dec 12 '16 at 20:54

score 1 · Answer 4 · answered Dec 12 '16 at 19:06

On my system I can't mix using std::cout with std::wcout and get sensible results. So you have to do these separately.

You should set the locale to that of the native system using std::locale::global(std::locale(""));.

Also use wide streams for the second two outputs

Either:

std::locale::global(std::locale(""));

std::cout << "\u2654" << std::endl;

Or:

std::locale::global(std::locale(""));

std::wcout << L"\u2654" << std::endl;
std::wcout << L'♔' << std::endl;

That should encourage the output streams to convert between the local system's encoding and either utf8 (1st example) or ucs16/utf32 (2nd example).

I think to be safest with the first example (editors can have other encodings) it is best to prefix the string with u8:

std::cout << u8"\u2654" << std::endl;

Unicode literal - how does this even make sense?

4 Answers4

Linked