What will happen if I omit the u8
prefix for string literals that contain universal character names?
So instead of:
u8"\u00a7some-text"
I write this:
"\u00a7some-text"
Without the u8
prefix, the string will be encoded in the execution character set of your platform. The execution character set may be UTF-8 (which is the default on several platforms), but cannot be assumed to always be UTF-8 (see this answer).
If the execution character set cannot encode a universal character name (or any other value in the string literal), the result is implementation-defined (i.e. it might result in an error or some sentinel value). For example, consider the code:
const char* c = "\u00a7";
When compiled using GCC 5.3 with -fexec-charset=ascii
, it fails with the error:
error: converting UCN to execution character set: Invalid or incomplete multibyte or wide character
This is because U+00A7 cannot be encoded in ASCII. However, using the u8
prefix:
const char* c = u8"\u00A7";
Compilation succeeds, and c
points to the bytes 0xC2
0xA7
0x00
.
If you use the u8
prefix, your string is guaranteed to be UTF-8 encoded, regardless of the platform's configuration.