Unicode string literals

Question

C++11 introduces a new set of string literal prefixes (and even allows user-defined suffixes). On top of this, you can directly use Unicode escape sequences to code a certain symbol without having to worry about encoding.

const char16_t* s16 = u"\u00DA";
const char32_t* s32 = U"\u00DA";

But can I use the unicode escape sequences in wchar_t string literals as well? It would seem to be a defect if this wasn't possible.

const wchar_t* sw = L"\u00DA";

The integer value of sw[0] would of course depend on what wchar_t is on a particular platform, but to all other effects, this should be portable, no?

I believe the value of `sw[0]` depends on what `wchar_t` is on a particular platform only to the extent of what the size of `wchar_t` is. I.e. `\u00DA` should always result in some Unicode encoding (UTF-8, UTF-16, UTF-32) of U+00DA, even when that's not the platform's normal encoding for that type. — bames53, Oct 17 '11 at 17:37
Actually the above is incorrect. The implementation is supposed to treat universal character names as it would the literal character. So if the implementation translates characters in a string literal to the execution character set then it should do so with UCNs as well. You're only guaranteed the UTF encoding if the UCN is inside a unicode literal (e.g., u8"\u00DA"). — bames53, Oct 19 '11 at 15:18

score 10 · Accepted Answer · edited May 23 '17 at 10:29

10

It would work, but it may not have the desired semantics. \u00DA will expand into as many target characters as necessary for UTF8/16/32 encoding, depending on the size of wchar_t, but bear in mind that wide strings do not have any documented, guaranteed encoding semantics -- they're simply "the system's encoding", with no attempt made to say what that is, or require the user to know what that is.

So it's best not to mix and match. Use either one, but not both, of the two:

system-specific: char*/"", wchar_t*/L"", \x-literals, mbstowcs/wcstombs
Unicode: char*/u8"", char16_t*/u"", char32_t*/U"", \u/\U literals.

(Here are some related questions of mine on the subject.)

edited May 23 '17 at 10:29

Community

1
1

answered Oct 03 '11 at 15:06

Kerrek SB

464,522
92
875
1,084

For the full details as to the background of this question, [this libc++ test](http://llvm.org/svn/llvm-project/libcxx/trunk/test/localization/locale.categories/category.ctype/locale.ctype.byname/is_1.pass.cpp) is failing on Windows at the `\x00DA` line. I wonder if I could replace this with `\u00DA` and have it work for all `wchar_t`'s that are large enough (ie 16 or 32-bit) – rubenvb Oct 03 '11 at 15:15
/u is for utf16, /U is for utf32, what is for utf8? (and I don't mean the string prefix, that's u8, I mean the hex prefix inside the string) – MarcusJ Apr 07 '18 at 11:42
1

You mean \, not /? Those are two different things. Also please note that I never said `\u` is for UTF-16. The escaped value is always an abstract codepoint (= number); it's just that `\U` takes a 32-bit number and `\u` takes a 16-bit number. I'm not really sure what a correct version of your question might be, perhaps something like an 8-bit-constrained codepoint reference, i.e. codepoints in the range [0, 256)? I guess that could exist, but it'd have extremely limited value, since most of those codepoints are readily available via ASCII and don't need escaping. – Kerrek SB Apr 07 '18 at 14:22

Unicode string literals

1 Answers1