C++ Literals and Unicode

Question

C++ Literals

Environment:

OS: Windows 10 Pro;
Compiler: GCC latest.
IDE: Code::Blocks latest.
working on: Console applications.

My understanding for numerical literals prefixes is that they are useful to determine the numerical value type (not sure).However, I have a lot of confusion on character and string literals prefixes and suffixes. I read a lot and spent days trying to understand the situation, but I got more questions and few answers. so I thought stack overflow could be of a lot of help.

Qs:

1- What are the correct use for the string prefixes u8 u U L?

I have the following code as example:

#include <iostream>
#include <string>
using namespace std;

int main()
{
    cout << "\n\n Hello World! (plain) \n";
    cout << u8"\n Hello World! (u8) \n";
    cout << u"\n Hello World! (u) \n";
    cout << U"\n Hello World! (U) \n";
    cout << L"\n Hello World! (plain) \n\n";

    cout << "\n\n\n";
}

The output is like this:

Hello World! (plain)

Hello World! (u8)

0x47f0580x47f0840x47f0d8

Q2: Why U u ans L has such output? I expected it is just to determine type not do encoding mapping (if it is).

Q3 Is there a simple and to the point references about encodings like UTF-8. I am confused about them, in addition I doubt that console applications is capable to deal with them. I see it is crucial to understand them.

Q4: Also I will appreciate a step by step reference that explain custom type literals.

*"Compiler: GCC latest."* - Please give the version number. Its entirely possible that between the time you made this post and my comment, a new version may have been released. Also take a look at http://en.cppreference.com/w/cpp/language/string_literal — WhiZTiM, Feb 20 '17 at 21:04
Generally best to ask one question per question. Multiple questions tend toward sprawling answers and make it harder for future users to find the one bit of information they are looking for. — user4581301, Feb 20 '17 at 21:14
For example, answering 1 requires a short discussion of character encodings, why `std::cout` seemed to handle UTF8, and `std::wcout` that would make an excellent stand-alone question. — user4581301, Feb 20 '17 at 21:18

score 3 · Accepted Answer · edited May 23 '17 at 12:18

First see: http://en.cppreference.com/w/cpp/language/string_literal

std::cout's class operator << is properly overloaded to print const char*. That is why the first two strings are printed.

cout << "\n\n Hello World! (plain) \n";
cout << u8"\n Hello World! (u8) \n";

As expected, prints¹:

Hello World! (plain)

Hello World! (u8)

Meanwhile std::cout's class has no special << overload for const char16_t*, const char32_t* and const wchar_t*, hence it will match <<'s overload for printing pointers, that is why:

cout << u"\n Hello World! (u) \n";
cout << U"\n Hello World! (U) \n";
cout << L"\n Hello World! (plain) \n\n";

Prints:

0x47f0580x47f0840x47f0d8

As you can see, there are actually 3 pointer values printed there: 0x47f058, 0x47f084 and 0x47f0d8

However, for the last one, you can get it to print properly using std::wcout

std::wcout << L"\n Hello World! (plain) \n\n";

prints

 Hello World! (plain)

^{1: The u8 literal printed as expected because of the direct ASCII mapping of the first few codepoints of UTF-8.}

Worth pointing out that u8 prints due to the first few bits of utf8 being mapped to ascii. A more complex string would be filled with garbage — user4581301, Feb 20 '17 at 21:23
@user4581301 a more complex string will print correctly on a sane OS (aka not Windows). cout doesn't care, it's the console driver that has to interpret the multibyte output the program sent. — Cubbi, Mar 20 '17 at 13:28

score 2 · Answer 2 · answered Feb 20 '17 at 21:17

1) Narrow multibyte string literal. The type of an unprefixed string literal is const char[].

2) Wide string literal. The type of a L"..." string literal is const wchar_t[].

3) UTF-8 encoded string literal. The type of a u8"..." string literal is const char[].

4) UTF-16 encoded string literal. The type of a u"..." string literal is const char16_t[].

5) UTF-32 encoded string literal. The type of a U"..." string literal is const char32_t[].

6) Raw string literal. Used to avoid escaping of any character, anything between the delimiters becomes part of the string. prefix, if present, has the same meaning as described above.

std::cout expects single byte characters, otherwise it can output a value such as 0x47f0580x47f0840x47f0d8. If your trying to output literals that consist of multi-byte characters (char16_t, char32_t, or wchar_t) then you need to use std::wcout to output them to the console, or convert them to a single byte character type. Raw string literals are very handy for formatting output. An example of Raw string literals is R"~(This is the text that will be output just as I typed it into the code editor!)~" and will be a single byte character string. If it's prefixed with any of the multi-byte qualifiers the raw string literal will be multi-byte. Here is a very comprehensive reference on string literals.

C++ Literals and Unicode

2 Answers2

Linked

Related