13

C++11 introduces a new set of string literal prefixes (and even allows user-defined suffixes). On top of this, you can directly use Unicode escape sequences to code a certain symbol without having to worry about encoding.

const char16_t* s16 = u"\u00DA";
const char32_t* s32 = U"\u00DA";

But can I use the unicode escape sequences in wchar_t string literals as well? It would seem to be a defect if this wasn't possible.

const wchar_t* sw = L"\u00DA";

The integer value of sw[0] would of course depend on what wchar_t is on a particular platform, but to all other effects, this should be portable, no?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
rubenvb
  • 74,642
  • 33
  • 187
  • 332
  • I believe the value of `sw[0]` depends on what `wchar_t` is on a particular platform only to the extent of what the size of `wchar_t` is. I.e. `\u00DA` should always result in some Unicode encoding (UTF-8, UTF-16, UTF-32) of U+00DA, even when that's not the platform's normal encoding for that type. – bames53 Oct 17 '11 at 17:37
  • 1
    Actually the above is incorrect. The implementation is supposed to treat universal character names as it would the literal character. So if the implementation translates characters in a string literal to the execution character set then it should do so with UCNs as well. You're only guaranteed the UTF encoding if the UCN is inside a unicode literal (e.g., u8"\u00DA"). – bames53 Oct 19 '11 at 15:18

1 Answers1

10

It would work, but it may not have the desired semantics. \u00DA will expand into as many target characters as necessary for UTF8/16/32 encoding, depending on the size of wchar_t, but bear in mind that wide strings do not have any documented, guaranteed encoding semantics -- they're simply "the system's encoding", with no attempt made to say what that is, or require the user to know what that is.

So it's best not to mix and match. Use either one, but not both, of the two:

  1. system-specific: char*/"", wchar_t*/L"", \x-literals, mbstowcs/wcstombs

  2. Unicode: char*/u8"", char16_t*/u"", char32_t*/U"", \u/\U literals.

(Here are some related questions of mine on the subject.)

Community
  • 1
  • 1
Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • For the full details as to the background of this question, [this libc++ test](http://llvm.org/svn/llvm-project/libcxx/trunk/test/localization/locale.categories/category.ctype/locale.ctype.byname/is_1.pass.cpp) is failing on Windows at the `\x00DA` line. I wonder if I could replace this with `\u00DA` and have it work for all `wchar_t`'s that are large enough (ie 16 or 32-bit) – rubenvb Oct 03 '11 at 15:15
  • /u is for utf16, /U is for utf32, what is for utf8? (and I don't mean the string prefix, that's u8, I mean the hex prefix inside the string) – MarcusJ Apr 07 '18 at 11:42
  • 1
    You mean \, not /? Those are two different things. Also please note that I never said `\u` is for UTF-16. The escaped value is always an abstract codepoint (= number); it's just that `\U` takes a 32-bit number and `\u` takes a 16-bit number. I'm not really sure what a correct version of your question might be, perhaps something like an 8-bit-constrained codepoint reference, i.e. codepoints in the range [0, 256)? I guess that could exist, but it'd have extremely limited value, since most of those codepoints are readily available via ASCII and don't need escaping. – Kerrek SB Apr 07 '18 at 14:22