How does file encoding affect C++11 string literals?

Question

You can write UTF-8/16/32 string literals in C++11 by prefixing the string literal with u8/u/U respectively. How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals? I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.

I understand you can still escape single unicode characters with \uNNNN, but that is not very readable for, say, a full Russian, or French sentence, which typically contain more than one unicode character.

What I understand from various sources is that u should become equivalent to L on current Windows implementations and U on e.g. Linux implementations. So with that in mind, I'm also wondering what the required behavior is for the old string literal modifiers...

For the code-sample monkeys:

string utf8string a = u8"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf16string b = u"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf32string c = U"L'hôtel de ville doit être là-bas. Ça c'est un fait!";

In an ideal world, all of these strings produce the same content (as in: characters after conversion), but my experience with C++ has taught me that this is most definitely implementation defined and probably only the first will do what I want.

Kerrek SB · Accepted Answer · 2011-07-22T18:59:28.510

11

In GCC, use -finput-charset=charset:

Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's "iconv" library routine.

Also check out the options -fexec-charset and -fwide-exec-charset.

Finally, about string literals:

char     a[] = "Hello";
wchar_t  b[] = L"Hello";
char16_t c[] = u"Hello";
char32_t d[] = U"Hello";

The size modifier of the string literal (L, u, U) merely determines the type of the literal.

edited Jul 22 '11 at 18:59

answered Jul 22 '11 at 18:45

Kerrek SB

464,522
92
875
1,084

1

You need a `const` in front of those literals. – Nicol Bolas Jul 22 '11 at 19:18
6

@Nicol No. Even assuming you meant the variables being declared, no. – Luc Danton Jul 22 '11 at 19:59
2

@Nicol: Why who what? `char x[] = "a"; x[0] = b;` – Kerrek SB Jul 22 '11 at 20:38
Your answer seems to contradict the standard, if I read correctly. In particular, the standard talks of e.g. _"initialized with the given characters as encoded in UTF-8"_ and _"A single c-char may produce more than one char16_t character in the form of surrogate pairs"_. This strongly suggests that the size modifier does not only determine the type, but also the encoding within the quotation marks (stating UTF-8 explicitly in the first and implying UTF-16 in the second case). – Damon Jan 29 '15 at 12:54
@Damon: I'm not sure about that. In fact, there's a proposal out for a `u8` literal specifically to allow a syntactic form which requires UTF-8. – Kerrek SB Jan 29 '15 at 14:17

score 7 · Answer 2 · answered Jul 22 '11 at 20:24

How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals. I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.

From n3290, 2.2 Phases of translation [lex.phases]

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. [Here's a bit about trigraphs.] Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

There are a lot of Standard terms being used to describe how an implementation deals with encodings. Here's my attempt at as somewhat simpler, step-by-step description of what happens:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...]

The issue of file encodings is handwaved; the Standard only cares about the basic source character set and leaves room for the implementation to get there.

Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character.

The basic source set is a simple list of allowed characters. It is not ASCII (see further). Anything not in this list is 'transformed' (conceptually at least) to a \uXXXX form.

So no matter what kind of literal or file encoding is used, the source code is conceptually transformed into the basic character set + a bunch of \uXXXX. I say conceptually because what the implementations actually do is usually simpler, e.g. because they can deal with Unicode directly. The important part is that what the Standard call an extended character (i.e. not from the basic source set) should be indistinguishable in use from its equivalent \uXXXX form. Note that C++03 is available on e.g. EBCDIC platforms, so your reasoning in terms of ASCII is flawed from the get go.

Finally, the process I described happens to (non raw) string literals too. That means your code is equivalent as if you'd have written:

string utf8string a = u8"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf16string b = u"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf32string c = U"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";

This is interesting. Does the `\u00F4` in the `u8` literal actually expand into two bytes? — Kerrek SB, Jul 22 '11 at 20:40
@Kerrek I tested on my implementation and a `"\u8XXXX"` can indeed be of a size strictly greater than two. I didn't quote the Standard for this because I'm not sure where to look beyond "A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8." (from 2.14.5 String literals [lex.string], paragraph 7). This could easily be a separate question. — Luc Danton, Jul 22 '11 at 20:44
Even the feeble `U+F4` is already two bytes in UTF-8 -- that's pretty cool, I didn't realize that there's actually true UTF support in the new C++ (beyond providing the data types). Nice! What happens in `utf16string` if you pass `\U0010FFFF`? — Kerrek SB, Jul 22 '11 at 20:46
@Kerrek Even the feeblest `u8"\u0001"` is of size two because of the null byte ;). For `u""` literals, the Standard explicitly mention (some paragraphs further) that "A single c-char may produce more than one char16_t character in the form of surrogate pairs.", but that's reaching the limits of my knowledge of UTF-16. I'm in chat for further discussion if you want. — Luc Danton, Jul 22 '11 at 20:53

Evgeniy · Answer 3 · 2015-10-21T12:06:06.860

In principle, questions of encoding only matter when you output your strings by making them visible to humans, which is not a question of how the programming language is defined, as its definition deals only with coding computation. So, when you decide, whether what you see in your editor is going to be the same as what you see in the output (any kind of images, be they on the screen or in a pdf), you should ask yourselves which convention the way your user-interaction library and your operating system were coded assumes. (Here is, for example, this kind of information for Qt5: with Qt5, what you see as a user of the application and what you see as its programmer coincides, if the contents of the old-fashioned string literals for your QStrings are encoded as utf8 in your source files, unless you turn on another setting in the course of the application's execution).

As a conclusion, I think Kerrek SB is right, and Damon is wrong: indeed, the methods of specifying a literal in the code ought to specify its type, not the encoding that is used in the source file for filling its contents, as the type of a literal is what concerns computation done to it. Something like u"string" is just an array of “unicode codeunits” (that is, values of type char16_t), whatever the operating system or any other service software later does to them and however their job looks for you or for another user. You just get to the problem of adding another convention for yourselves, that makes a correspondence between the “meaning” of numbers under computation (namely, they present the codes of Unicode), and their representation on your screen as you work in your text editor. How and whether you as a programmer use that “meaning” is another question, and how you could enforce this other correspondence is naturally going to be implementation-defined, because it has nothing to do with coding computation, only with comfortability of a tool's use.

How does file encoding affect C++11 string literals?

3 Answers3

Linked