Here are some excerpts from my copy of the 2014 draft standard N4140
22.5 Standard code conversion facets [locale.stdcvt]
3 For each of the three code conversion facets
codecvt_utf8
,codecvt_utf16
, andcodecvt_utf8_utf16
:
(3.1) —Elem
is the wide-character type, such aswchar_t
,char16_t
, orchar32_t
.4 For the facet
codecvt_utf8
:
(4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size ofElem
) within the program.
One interpretation of these two paragraphs is that wchar_t
must be encoded as either UCS2 or UCS4. I don't like it much because if it's true, we have an important property of the language buried deep in a library description. I have tried to find a more direct statement of this property, but to no avail.
Another interpretation that wchar_t
encoding is not required to be either UCS2 or UCS4, and on implementations where it isn't, codecvt_utf8
won't work for wchar_t
. I don't like this interpretation much either, because if it's true, and neither char
nor wchar_t
native encodings are Unicode, there doesn't seem to be a way to portably convert between those native encodings and Unicode.
Which of the two interpretations is true? Is there another one which I overlooked?
Clarification I'm not asking about general opinions about suitability of wchar_t
for software development, or properties of wchar_t
one can derive from elsewhere. I am interested in these two specific paragraphs of the standard. I'm trying to understand what these specific paragraphs entail or do not entail.
Clarification 2. If 4.1 said "The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 or whatever encoding is imposed on wchar_t by the current global locale" there would be no problem. It doesn't. It says what it says. It appears that if one uses std::codecvt_utf8<wchar_t>
, one ends up with a bunch of wchar_t
encoded as UCS2 or UCS4, regardless of the current global locale. (There is no way to specify a locale or any character conversion facet for codecvt_utf8
). So the question can be rephrased like this: is the conversion result directly usable with the current global locale (and/or with any possible locale) for output, wctype
queries and so on? If not, what it is usable for? (If the second interpretation above is correct, the answer would seem to be "nothing").