17

Let x be any member of the basic source character set. 'x' and L'x' are members of the basic execution character set and the basic execution wide-character set, respectively.

Is it true that integral values of 'x' and L'x' must be equal? It looks like the standard does not require that, which makes sense. One can conceivably use say EBCDIC as the narrow charset and Unicode as the wide charset.

Is it true that std::use_facet<std::ctype<wchar_t>>(std::locale()).widen('x') should be equal to L'x' in some (or any) locale? In this case it does make sense to require that but I cannot find such requirement in the standard either. Likewise, is std::use_facet<std::ctype<wchar_t>>(std::locale()).narrow(L'x') the same as 'x'?

If the above is not true, then which one of these

std::wcout << L'x';
std::wcout << ct.widen('x');

should output x? ct is an appropriate locale facet.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
  • Microsoft's compiler has Windows ANSI as narrow character set and Unicode as wide character set. Even when Windows ANSI is Windows ANSI Western the codes are not the same. Particularly troublesome, the Euro sign €. – Cheers and hth. - Alf Aug 12 '15 at 08:25
  • @Cheersandhth.-Alf € is not in the basic source character set, no problem here. – n. m. could be an AI Aug 12 '15 at 08:27
  • Depending on the national language Windows is installed for, € is in the execution character set. That includes for the USA and Norway. You have to disregard some erroneous documentation that states that the execution character set is ASCII, because believing it you'd end up producing programs with incorrect results, and wouldn't be able to make sense of the compiler's warnings. ;-) – Cheers and hth. - Alf Aug 12 '15 at 08:31
  • @Cheersandhth.-Alf The C++ standard fixes all 96 members of the basic source character set in \[lex.charset\]. € is not a member. – n. m. could be an AI Aug 12 '15 at 08:34
  • ↑ Sorry for mindlessly repeating your use of "basic source character set" (now edited, corrected). I didn't stop to think that it's incorrect. The character set of the basic source character set is ASCII minus a few characters, such as $. It would be impractical to not use $, wouldn't you say? – Cheers and hth. - Alf Aug 12 '15 at 08:35
  • One can't portably use `$` or `è` or `®` or `€` or `£` in C++ *source code*. Fine with me, most of my programs don't need any of them. – n. m. could be an AI Aug 12 '15 at 10:23

1 Answers1

7

There is little that can be guaranteed in practice about wide character sets, because the C and C++ standards require that all wide characters can be represented with a single encoding value, while the standard in Windows programming is UTF-16 encoded wide text. Originally Windows wide text was simply original 16-bit Unicode, now called UCS-2, which is still used in Windows console windows, and which conforms to the C and C++ requirements. UTF-16 is an extension of UCS-2 that uses two encoding values, called a surrogate pair, for characters outside the original Unicode's Basic Multilingual Plane, a.k.a. the BMP.


Re

Is it true that integral values of 'x' and L'x' must be equal? [When x is a member of the C++ basic source character set]

The basic source character set is a subset of ASCII, and nearly all extant general character encodings, including in particular the Unicode encodings, are extensions of ASCII. There is one exception, namely IBM's EBCDIC character encodings (there are multiple variants). However, if it's still used at all, then that's on IBM mainframes.

Thus in practice you have that guarantee, but in the formal you don't have it. More importantly, though, it's irrelevant. For example, the basic source character set lacks the $ sign, which you can hardly expect to do without, i.e. limiting oneself to the basic source character set is not a practical proposition.


Re

Is it true that std::use_facet<std::ctype<wchar_t>>(std::locale()).widen('x') should be equal to L'x' in some (or any) locale [When x is a member of the C++ basic source character set]

For the same reason as for the literals, yes in practice, no in the formal (since encodings like EBCDIC are supported), and also this is irrelevant for the practitioner.

In particular, for the in-practice, a more relevant consideration is that Microsoft's Visual C++ has (undocumented) Windows ANSI as its execution character set, and UTF-16 as the wide character encoding. E.g. on my machine the execution character set is Windows 1252, a.k.a. Windows ANSI Western. And some characters, in particular €, have totally different Unicode character codes. Worse, there might just be some narrow character set that could be used as execution character set where the UTF-16 encoding of some character would use a surrogate pair of encoding values. And in that case widen can't even represent the result; there's no room for it.

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • Visual C++ is non-conforming because some characters cannot be represented as a single `wchar_t`. If we exclude those characters and postulate that we work with UCS-2 only, then everything appears OK, because Windows ANSI and UCS-2 presumably have the first 127 characters identical in whatever code page. – n. m. could be an AI Aug 12 '15 at 10:31
  • @n.m.: You're right that Visual C++ ***and every other Windows C and C++ compiler*** is formally non-conforming. AFAIK that's due to silly 1990's politics in the C and C++ committees, standardizing wording that was incompatible with very solidly established practice. That means that the formal doesn't really help you in this area, because the formal here is of so low quality (it's pure politics) that it's utterly unusable. – Cheers and hth. - Alf Aug 12 '15 at 12:43
  • " the C and C++ standards require that all wide characters can be represented with a single encoding value" citation? – Yakk - Adam Nevraumont Aug 12 '15 at 13:23
  • 1
    @Yakk 3.9.1 \[basic.fundamental\]/5 "Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales". – n. m. could be an AI Aug 12 '15 at 13:27
  • @n.m. So long as the character set isn't the character set, things work. Gah. – Yakk - Adam Nevraumont Aug 12 '15 at 13:30
  • @Yakk You can provide an implementation that consists of a compiler by a third party (e.g. gcc, to steer clear of copyright issues) and your own alternative documentation. Nothing wrong with that. – n. m. could be an AI Aug 12 '15 at 13:42