How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals. I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.
From n3290, 2.2 Phases of translation [lex.phases]
Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary. The set of physical source file characters accepted is
implementation-defined. [Here's a bit about trigraphs.] Any source
file character not in the basic source character set (2.3) is replaced
by the universal-character-name that designates that character. (An
implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same
extended character expressed in the source file as a
universal-character-name (i.e., using the \uXXXX notation), are
handled equivalently except where this replacement is reverted in a
raw string literal.)
There are a lot of Standard terms being used to describe how an implementation deals with encodings. Here's my attempt at as somewhat simpler, step-by-step description of what happens:
Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set [...]
The issue of file encodings is handwaved; the Standard only cares about the basic source character set and leaves room for the implementation to get there.
Any source
file character not in the basic source character set (2.3) is replaced
by the universal-character-name that designates that character.
The basic source set is a simple list of allowed characters. It is not ASCII (see further). Anything not in this list is 'transformed' (conceptually at least) to a \uXXXX
form.
So no matter what kind of literal or file encoding is used, the source code is conceptually transformed into the basic character set + a bunch of \uXXXX
. I say conceptually because what the implementations actually do is usually simpler, e.g. because they can deal with Unicode directly. The important part is that what the Standard call an extended character (i.e. not from the basic source set) should be indistinguishable in use from its equivalent \uXXXX
form. Note that C++03 is available on e.g. EBCDIC platforms, so your reasoning in terms of ASCII is flawed from the get go.
Finally, the process I described happens to (non raw) string literals too. That means your code is equivalent as if you'd have written:
string utf8string a = u8"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf16string b = u"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf32string c = U"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";