7

Here are some excerpts from my copy of the 2014 draft standard N4140

22.5 Standard code conversion facets [locale.stdcvt]

3 For each of the three code conversion facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16:
(3.1) — Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.

4 For the facet codecvt_utf8:
(4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.

One interpretation of these two paragraphs is that wchar_t must be encoded as either UCS2 or UCS4. I don't like it much because if it's true, we have an important property of the language buried deep in a library description. I have tried to find a more direct statement of this property, but to no avail.

Another interpretation that wchar_t encoding is not required to be either UCS2 or UCS4, and on implementations where it isn't, codecvt_utf8 won't work for wchar_t. I don't like this interpretation much either, because if it's true, and neither char nor wchar_t native encodings are Unicode, there doesn't seem to be a way to portably convert between those native encodings and Unicode.

Which of the two interpretations is true? Is there another one which I overlooked?

Clarification I'm not asking about general opinions about suitability of wchar_t for software development, or properties of wchar_t one can derive from elsewhere. I am interested in these two specific paragraphs of the standard. I'm trying to understand what these specific paragraphs entail or do not entail.

Clarification 2. If 4.1 said "The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 or whatever encoding is imposed on wchar_t by the current global locale" there would be no problem. It doesn't. It says what it says. It appears that if one uses std::codecvt_utf8<wchar_t>, one ends up with a bunch of wchar_t encoded as UCS2 or UCS4, regardless of the current global locale. (There is no way to specify a locale or any character conversion facet for codecvt_utf8). So the question can be rephrased like this: is the conversion result directly usable with the current global locale (and/or with any possible locale) for output, wctype queries and so on? If not, what it is usable for? (If the second interpretation above is correct, the answer would seem to be "nothing").

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
  • 1
    wchar_t is non-portrable. Eg on Unix it is UTF-32 and on Windows it is UTF-16 (not UCS2) – Richard Critten Aug 04 '16 at 14:55
  • 6
    `wchar_t` is an integral type. What makes you think it has a fixed encoding? It can store the number `7`, and you can interpret that as meaning "the user clicked on the left button". Somewhere else you can interpret `7` in a `wchar_T` it to mean "activate fire alarm", and elsewhere as a lower case `a`. The interesting problem is what happens when you read from input and the like, but that isn't the encoding of `wchar_t` but the encoding the io does... The facets describe *what is the encoding result of using that facet on the streaming operation*... – Yakk - Adam Nevraumont Aug 04 '16 at 14:55
  • `wchar_t` just has to be bigger than `char`, that's it.. – David Haim Aug 04 '16 at 14:56
  • 2
    `wchar_t` is simply *capable* of containing either `UCS2` or `UCS4` it is not *mandated* to. – Galik Aug 04 '16 at 14:59
  • An *encoding* is an assignment from numbers to meaning. A type doesn't come with such semantics. – Kerrek SB Aug 04 '16 at 15:00
  • It doesn't, big problem. They had to fix it again with char16_t and char32_t in C++11. Albeit that it doesn't specify an encoding either but whomever is going to not use it for utf16 and utf32 is going to get a lot of dirty looks. – Hans Passant Aug 04 '16 at 15:03
  • wchar_t is a fiasco. there really ought to be a library making unicode string handling more intuitive in a cross platform way... related: http://stackoverflow.com/questions/2722951/how-to-deal-with-unicode-strings-in-c-c-in-a-cross-platform-friendly-way – Richard Hodges Aug 04 '16 at 15:05
  • @Yakk The two paragraphs from the standard I quoted make me think so. Can you interpret them differently? If so, how? This is the gist of my question. – n. m. could be an AI Aug 04 '16 at 15:09
  • @Yakk: What makes me think `wchar_t` has a fixed encoding? Simple: [`std::iswalpha`](http://en.cppreference.com/w/cpp/string/wide/iswalpha). No facets. – MSalters Aug 04 '16 at 15:10
  • @RichardCritten I'm not asking how I or anyone else should use wchar_t, I'm asking what the standard says. – n. m. could be an AI Aug 04 '16 at 15:11
  • @MSalters "specific to the current locale". `std::iswalpha` uses the local's encoding, which is merely global state. – Yakk - Adam Nevraumont Aug 04 '16 at 15:14
  • The paragraphs merely state what the code converters must do, they do not dictate what the `wchar_t` must generally contain. – Galik Aug 04 '16 at 15:32
  • @Yakk It doesn't matter where the encoding lives. It could be in a locale imbued in a stream or in the current global locale. Fact is, there is an encoding (possibly more than one), and I'm asking whether it/they must all be UCS-something. – n. m. could be an AI Aug 04 '16 at 15:39
  • @Galik So are you implying that the second interpretation is correct? – n. m. could be an AI Aug 04 '16 at 15:39
  • I don't believe either interpretation is correct. I think it is simply saying that `codecvt_utf8` uses `wchar_t` and must chose what encoding to produce based on the size of `wchar_t`. I don't see it saying anything about `wchar_t` itself. – Galik Aug 04 '16 at 15:49
  • The only implication for `wchar_t` (and the other char types) is that it must be *capable* of containing either `UCS2` or `UCS4` depending on its size. – Galik Aug 04 '16 at 15:54
  • "*whatever encoding is imposed on wchar_t by the current global locale*" Facets do not impose encodings on data types. They impose encodings on *operations*. So that statement would be *nonsense*. – Nicol Bolas Aug 04 '16 at 16:00
  • @NicolBolas A data type without its set of operation is useless, so any encoding imposed on operations is imposed on the data type itself. – n. m. could be an AI Aug 04 '16 at 16:07
  • @n.m.: You can always generate data of that data type with operations that *don't* have that encoding imposed. The set of operations that generate `wchat_t` data is not limited to things that use the global facet. – Nicol Bolas Aug 04 '16 at 16:09
  • @NicolBolas yes, I can say `wchar_t x = 17`. This has no standard meaning in the realm of operations that make sense for characters (as opposed to integers) So what meaning does `codecvt_utf8` have? – n. m. could be an AI Aug 04 '16 at 16:13
  • It is a bit like saying that `std::iota` will fill a buffer with specific values. That does not mean the buffer is *confined* to those values and nothing else is allowable. – Galik Aug 04 '16 at 16:55
  • @KerrekSB "A type doesn't come with such semantics" It does. An arbitrary quote from the standard: "A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named". So it would appear there is a mapping, imposed by the implementation, between universal character names (==unicode code points) and integer codes of members of the appropriate execution character sets (==wchar_t values in this case). I would call this mapping "the encoding of wchar_t" because that's what it is. – n. m. could be an AI Aug 04 '16 at 23:35
  • @n.m.: That's saying that the type `wchar_t` *can be used* to hold values of the execution character set (which have meaning). It does not inherently tie that type to a particular encoding. The type may quite possibly be able to hold values that are not part of the execution character set. That's similar to how `char32_t` can hold all the characters in a UTF-32 string, but it can also hold values that are not part of the Unicode encoding. Or, if you will, how `size_t` can hold the size of any object, but not every value of `size_t` can actually be realized as the size of some object. – Kerrek SB Aug 04 '16 at 23:51
  • @KerrekSB The passage says that `wchar_t` **is** used to hold values of the execution character set. The program translation process makes it so, and it makes use of one special mapping in doing so. One can map `wchar_t` to other sets, including other character sets, but that's irrelevant. The special mapping exists and it comes with the implementation. **I want this special mapping to be accessible from C++ programs**. – n. m. could be an AI Aug 05 '16 at 00:15
  • (By the way, half of all that codecvt stuff is ill-defined and unimplementable.) – Kerrek SB Aug 05 '16 at 00:23
  • @KerrekSB Care to elaborate? – n. m. could be an AI Aug 05 '16 at 00:27
  • @n.m.: no, the details escape me, but I've heard several library vendors say that repeatedly. – Kerrek SB Aug 05 '16 at 00:32
  • RE: Clarification 2. `codecvt_utf8` specifically only converts unicode to unicode. If you want to convert current locale to/from unicode then I believe you can use [std::mbrtoc32](http://en.cppreference.com/w/cpp/string/multibyte/mbrtoc32) etc... – Galik Aug 05 '16 at 00:49
  • @Galik yeah I write about std::mbrtoc32 in my own answer but I still have my doubts... – n. m. could be an AI Aug 05 '16 at 00:55
  • Another interpretation is that `codecvt_utf8` and `codecvt_utf16` see `wchar_t` as either UCS2 or UCS4, regardless of what the rest of the universe sees it as. – Justin Time - Reinstate Monica Dec 27 '16 at 23:05

7 Answers7

8

wchar_t is just an integral literal. It has a min value, a max value, etc.

Its size is not fixed by the standard.

If it is large enough, you can store UCS-2 or UCS-4 data in a buffer of wchar_t. This is true regardless of the system you are on, as UCS-2 and UCS-4 and UTF-16 and UTF-32 are just descriptions of integer values arranged in a sequence.

In C++11, there are std APIs that read or write data presuming it has those encodings. In C++03, there are APIs that read or write data using the current locale.

22.5 Standard code conversion facets [locale.stdcvt]

3 For each of the three code conversion facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16:

(3.1) — Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.

4 For the facet codecvt_utf8:

(4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.

So here it codecvt_utf8_utf16 deals with utf8 on one side, and UCS2 or UCS4 (depending on how big Elem is) on the other. It does conversion.

The Elem (the wide character) is presumed to be encoded in UCS2 or UCS4 depending on how big it is.

This does not mean that wchar_t is encoded as such, it just means this operation interprets the wchar_t as being encoded as such.

How the UCS2 or UCS4 got into the Elem is not something this part of the standard cares about. Maybe you set it in there with hex constants. Maybe you read it from io. Maybe you calculated it on the fly. Maybe you used a high-quality random-number generator. Maybe you added together the bit-values of an ascii string. Maybe you calculated a fixed-point approximation of the log* of the number of seconds it takes the moon to change the Earth's day by 1 second. Not these paragraphs problems. These pragraphs simply mandate how bits are modified and interpreted.

Similar claims hold in other cases. This does not mandate what format wchar_t have. It simply states how these facets interpret wchar_t or char16_t or char32_t or char8_t (reading or writing).

Other ways of interacting with wchar_t use different methods to mandate how the value of the wchar_t is interpreted.

iswalpha uses the (global) locale to interpret the wchar_t, for example. In some locals, the wchar_t may be UCS2. In others, it might be some insane cthulian encoding whose details enable you to see a new color from out of space.

To be explicit: encodings are not the property of data, or bits. Encodings are properties of interpretation of data. Quite often there is only one proper or reasonable interpretation of data that makes any sense, but the data itself is bits.

The C++ standard does not mandate what is stored in a wchar_t. It does mandate what certain operations interpret the contents of a wchar_t to be. That section describes how some facets interpret the data in a wchar_t.

Community
  • 1
  • 1
Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
6

No.

wchar is only required to hold the biggest locale supported by the compiler. Which could theoretically fit in a char.

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

— C++ [basic.fundamental] 3.9.1/5

as such it's not even required to support Unicode

The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers.

ISO/IEC 10646:2003 Unicode standard 4.0

Community
  • 1
  • 1
Francesco Dondi
  • 1,064
  • 9
  • 17
4

Let us differentiate between wchar_t and string literals built using the L prefix.

wchar_t is just an integer type, which may be larger than char.

String literals using the L prefix will generate strings using wchar_t characters. Exactly what that means is implementation-dependent. There is no requirement that such literals use any particular encoding. They might use UTF-16, UTF-32, or something else that has nothing to do with Unicode at all.

So if you want a string literal which is guaranteed to be encoded in a Unicode format, across all platforms, use u8, u, or U prefixes for the string literal.

One interpretation of these two paragraphs is that wchar_t must be encoded as either UCS2 or UCS4.

No, that is not a valid interpretation. wchar_t has no encoding; it's just a type. It is data which is encoded. A string literal prefixed by L may or may not be encoded in UCS2 or UCS4.

If you provide codecvt_utf8 a string of wchar_ts which are encoded in UCS2 or UCS4 (as appropriate to sizeof(wchar_t)), then it will work. But not because of wchar_t; it only works because the data you provide it is correctly encoded.

If 4.1 said "The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 or whatever encoding is imposed on wchar_t by the current global locale" there would be no problem.

The whole point of those codecvt_* facets is to perform locale-independent conversions. If you want locale-dependent conversions, you shouldn't use them. You should instead use the global codecvt facet.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • 2
    @n.m.: My interpretation of those paragraphs is that they mean exactly what they say. Your interpretation of them is confused because your understanding of the words they use is confused. That's why I explained what those words mean. `wchar_t` is not an encoding. It *has no encoding*; it's just a type. – Nicol Bolas Aug 04 '16 at 15:36
  • There is one or more encodings imposed on `wchar_t` by various locale facets. I'm asking whether any or all of them must be UCS-whatever. – n. m. could be an AI Aug 04 '16 at 15:49
  • 1
    @n.m.: No, locale facets impose *nothing* on `wchar_t`. They impose encodings on certain operations. So you could build a string for an encoding with a iostream by using a locale that imposes that encoding on the stream. But that has nothing to do with the behavior of `wchar_t` *itself*; that only affects the data stored in the `wchar_t` array. And locales impose nothing on `codecvt` facets. – Nicol Bolas Aug 04 '16 at 15:50
  • "They impose encodings on certain operations" That's imposing an encoding on `wchar_t` in my book. I'm building strings to perform operations on them, not to frame them and hang them on the wall. codecvt is a locale facet, locales just *have* them. – n. m. could be an AI Aug 04 '16 at 16:02
  • I want a very simple thing, to be able to convert UTF-8 to wchar_t in a way that is consistent with other uses of wchar_t. Namely, printing to (untampered with) wcout, comparing with L"" literals, and/or querying isw... bits, without touching my current global locale or stream locales. I know I can convert UTF-8 to UCS4 and stuff these values to wchar_t, but this seems to be a rather useless exercise, unless I happen to know that operations I mentioned do in fact use UCS4. – n. m. could be an AI Aug 04 '16 at 16:36
  • @n.m.: "*to be able to convert UTF-8 to wchar_t in a way that is consistent with other uses of wchar_t.*" There isn't a way to do that, as shown by [the table at the bottom of this page](http://en.cppreference.com/w/cpp/locale/codecvt). Not without knowing *exactly* what the encoding of your `wchar_t`-based strings are, which varies by platform. "*I know I can convert UTF-8 to UCS4 and stuff these values to wchar_t*" Actually, you don't know that you can do that, since `wchar_t` is not required to be able to store UTF-32 codepoints. – Nicol Bolas Aug 04 '16 at 16:44
  • @n.m. Agreeing with NicolBolas here. The standard specifies, in § 3.9.1.5, that "Type `wchar_t` is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales"; while this guarantees that any platform which supports UTF-32 will be able to store any UTF-32 code point in a `wchar_t`, it doesn't guarantee that any system that only supports, say, UTF-8 or UTF-16 will be able to. – Justin Time - Reinstate Monica Dec 27 '16 at 21:52
  • It should also be noted that Windows explicitly _breaks_ this rule, likely for backwards compatibility: `wchar_t` is 16 bits on Windows. – Justin Time - Reinstate Monica Dec 27 '16 at 21:54
  • @JustinTime: But the VC standard library's locales don't support UTF-32. That's what gives them the freedom to limit `wchar_t` to 16-bits. – Nicol Bolas Dec 27 '16 at 22:56
  • @NicolBolas Hmm... seems I'm not 100% sure how to interpret that specification, actually. Due to the wording, there are two possible ways to interpret it regarding Unicode: 1) Each UTF-_n_ variant of Unicode can be treated differently, so a platform that supports UTF-16 requires `wchar_t` to be at least 16 bits, but not necessarily 32 bits, or 2) Since Unicode characters can be up to 32 bits regardless of whether you use UTF-8, UTF-16, UTF-32, UCS-2, or UCS-4, `wchar_t` must be at least 32 bits on any platform that supports Unicode. – Justin Time - Reinstate Monica Jan 02 '17 at 22:54
  • I know [cppreference](http://en.cppreference.com/w/cpp/language/types) leans towards #2, since they specifically mention Windows as a notable exception to the rule. When I made my comment, I appear to have mixed both interpretations, probably because I consulted cppreference while looking `wchar_t`'s specs up. I'm still not sure which is the correct interpretation, myself. – Justin Time - Reinstate Monica Jan 02 '17 at 22:57
2

Both your interpretations are incorrect. The standard doesn't require that there be a single wchar_t encoding, just like it doesn't require a single char encoding. The codecvt_utf8 facet must convert between UTF-8 and UCS-2 or UCS-4. This true even UTF-8, UCS-2, and UCS-4 are not supported as character sets in any locale.

If Elem is of type wchar_t and isn't big enough to store a UCS-2 value than then the conversion operations of the codecvt_utf8 facet are undefined because the standard doesn't say what happens in that case. If it is big enough (or if you want to argue that the standard requires that it must be big enough) then it's merely implementation defined whether the UCS-2 or UCS-4 wchar_t values the facet generates or consumes are in an encoding compatible with any locale defined wchar_t encoding.

Ross Ridge
  • 38,414
  • 7
  • 81
  • 112
  • I don't see how they can be both incorrect. It seems to me that your answer implies that the second one is correct (if not, please point out where it fails). – n. m. could be an AI Aug 04 '16 at 16:47
  • @n.m Your second interpretation fails on two points. First it assumes there is one single global `wchar_t` encoding at time. There's a single default locale-specific **wide character** encoding, but this only affects certain local dependent library functions. Second the `codecvt_utf8` facet is required to convert between UCS-2/4 and UTF-8 values when `Elem` is `wchar_t`, if `wchar_t` is big enough. If `wchar_t` is, say, 16-bits, then then the `convert_utf8/16` facets must convert between UCS-2, but this doesn't place a requirement on anything else to use UCS-2. – Ross Ridge Aug 04 '16 at 17:46
  • Frankly I don't see where the second interpretation assumes anything like that. If in some implementation the default wchar_t encoding of any locale, or some defined locale, is UCS4, then obviously `codecvt_utf8` is going to be compatible with that locale encoding. The question is whether an implementation is required to make it true or not, The 2nd interpretation says no, it is not. But perhapss it is not worded the best possible way. – n. m. could be an AI Aug 04 '16 at 18:23
  • @n.m. Your second interpretation says that `codecvt_utf8` won't work if "`wchar_t` encoding is not required to be either UCS2 or UCS4". The standard doesn't require "`wchar_t` encoding", whatever you think that means, to be UCS-2/4, but it does require that `codecvt_ut8` to work. You could argue that the requirements on `codecvt_utf8` place requirements on the size of `wchar_t`, but they don't place requirements on the encoding used by anything else, anywhere else in the standard. – Ross Ridge Aug 04 '16 at 18:50
  • "it does require that codecvt_ut8 to work" perhaps, for some definition of "work". It doesn't require it to work *sensibly* (i.e. in a way that is compatible with other wchar_t functionality; if I convert `u"abc"`, the result is not required to be equal to L"abc"` which falls under "not working" in my book). I have added my own answer, you are welcome to comment. – n. m. could be an AI Aug 04 '16 at 19:48
  • @n.m.It seems to me the `codecvt_utf8` does work sensibly, since it's designed to handle the case where the programmer can't assume that other functionality supports UTF-8 and UCS-2/4. It's not designed to require full Unicode support on implementations, just provide some basic functionality in cases where programmers want to use Unicode in a portable program. As such it's merely an alternative to programmers writing their own conversion code, and that's enough to make it both useful and sensible. – Ross Ridge Aug 04 '16 at 20:12
  • In my book, `codecvt_utf8` would be working sensibly if it converted between utf-8 and the native `wchar_t` encoding (yes I'm sure we can talk about the native `wchar_t` encoding). Conversion between utf_8 and ucs2/ucs4 is handled by `codecvt_utf8` and `codecvt_utf8`. Why `codecvt_utf8` is ever needed? – n. m. could be an AI Aug 04 '16 at 20:26
  • @n.m. To convert between UCS-2/4 encoded `wchar_t` values regardless of whatever the native wide character encodings are. Programs aren't limited to using `wchar_t` with the implementation defined wide character encodings. For that matter they can also choose to assume that UCS-2/4 is an implementation defined encoding. It wouldn't be sensible to make `codecvt_utf8` dependent on locale since it's designed to perform a specific conversion regardless of locale. The functionality you expect of it should be in some other locale dependent facility. – Ross Ridge Aug 04 '16 at 22:04
  • Why would anyone choose `wchar_t` to store UCS-2/4 encoded data, unless it is somehow known that UCS-2/4 is the native encoding for it? It seems that `char16_t` and `char32_t` would be much better candidates. – n. m. could be an AI Aug 04 '16 at 23:08
  • @n.m. Why should the standard not allow someone to choose `wchar_t`? The programmer could easily know that UCS-2/4 is a native encoding. It's not all that uncommon for that to be true. – Ross Ridge Aug 04 '16 at 23:30
  • If the programmer knows that UCS-2/4 is a native encoding, then my proposed "sensible" semantics of `codecvt_utf8` coincides with the standard semantics and the programmer is more than welcome to use it. It is when this information is not known `codecvt_utf8` makes no sense. – n. m. could be an AI Aug 04 '16 at 23:43
  • @n.m. No, as I said your proposal isn't sensible because it would make `codecvt_utf8` locale dependent. Anyways, your question was supposed to be about what the standard actually says, not what would make more sense to you. – Ross Ridge Aug 05 '16 at 00:40
1

It appears your first conclusion is shared by Microsoft who enumerate the possible options, and note that UTF-16, although "widely used as such[sic]" is not a valid encoding.

The same wording is also used by QNX, which points at the source of the wording: Both QNX and Microsoft derive their Standard Library implementation from Dinkumware.

Now, as it happens, Dinkumware is also the author of N2401 which introduced these classes. So I'm going to side with them.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • *It appears your first conclusion is shared by Microsoft* - Could you elaborate? The only thing I can get from that link is the definition of UCS-* / UTF-*, not that `wchar_t` must be encoded as UCS-2/4. – Holt Aug 04 '16 at 15:43
  • Hmm, Microsoft says "Represents a locale facet that converts between wide characters encoded as UCS-2 or UCS-4 ...". It doesn't seem to imply there are no other possibilities. I remember working with machines where wchar_t was JIS one or the other, are such environments unsupported by current C++? – n. m. could be an AI Aug 04 '16 at 15:43
  • @Holt: That bit follows "... several character encodings. For wide characters ... : " followed by the list defining UCS2, UCS4, and UTF-16. There is no hint to suggest the list is merely examples; it appears to be exhaustive. – MSalters Aug 04 '16 at 15:48
  • @MSalters These are the only ones that appears in the standard, so they merely define possible interpretation of term in the standard. At least that is how I see it. – Holt Aug 04 '16 at 15:52
1

As Elem can be wchar_t, char16_t, or char32_t, the clause 4.1 says nothing about a required wchar_t encoding. It states something about the conversion performed.

From the wording, it is clear that the conversion is between UTF-8 and either UCS-2 or UCS-4, depending on the size of Elem. So if wchar_t is 16 bits, the conversion will be with UCS-2, and if it is 32 bits, UCS-4.

Why does the standard mention UCS-2 and UCS-4 and not UTF-16 and UTF-32 ? Because codecvt_utf8 will convert a multi-byte UTF8 to a single wide character:

  • UCS-2 is a subset of unicode, but there is no surogate pair encoding contrary to UTF-16
  • UCS-4 is the same as UTF-32, now (but looking at the growing number of emojis, maybe one day there couldn't be enough of 32 bits, and you would have a UTF-64, and UTF32 surrogate pairs that would not be supported by codecvt_utf8)

Although, it is not clear to me what will happen, if an UTF-8 text would contain a sequence corresponds to a unicode character that is not available in UCS-2 used for a receiving char16_t.

Christophe
  • 68,716
  • 7
  • 72
  • 138
0

The first interpretation is conditionally true.

If __STDC_ISO_10646__ macro (imported from C) is defined, then wchar_t is a superset of some version of Unicode.

__STDC_ISO_10646__
An integer literal of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month.

It appears that if the macro is defined, some kind of UCS4 can be assumed. (Not UCS2 as ISO 10646 never had a 16-bit version; the first release of ISO 10646 corresponds to Unicode 2.0).

So if the macro is defined, then

  • there is a "native" wchar_t encoding
  • it is a superset of some version of UCS4
  • the conversion provided by codecvt_utf8<wchar_t> is compatible with this native encoding

None of these things are required to hold if the macro is not defined.

There are also __STDC_UTF_16__ and __STDC_UTF_32__ but the C++ standard doesn't say what they mean. The C standard says that they signify UTF-16 and UTF-32 encodings for char16_t and char32_t respectively, but in C++ these encodings are always used.

Incidentally, the functions mbrtoc32 and c32rtomb convert back and forth between char sequences and char32_t sequences. In C they only use UTF-32 if __STDC_UTF_32__ is defined, but in C++ UTF-32 is always used for char32_t. So it would appear than even if __STDC_ISO_10646__ is not defined, it should be possible to convert between UTF-8 and wchar_t by going from UTF-8 to UTF-32-encoded char32_t to natively encoded char to natively encoded wchar_t, but I'm afraid of this complex stuff.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243