In the ¹comp.lang.c++ Usenet group I recently asserted, based on what I thought I knew, that Windows' 16-bit wchar_t
, with UTF-16 encoding where sometimes two such values (called a “surrogate pair”) is needed for a single Unicode code point, is invalid for representing Unicode.
It's certainly inconvenient and in conflict with the assumption of the C and C++ standard libraries (e.g. character classification) that each code point is represented as a single value, although the Unicode consortium's ²Technical Note 12 from 2004 makes a good case for using UTF-16 for internal processing, with an impressive list of software that does.
And certainly it seems as if the original intent was to have one wchar_t
value per code point, consistent with the assumptions of the C and C++ standard libraries. E.g. in the web page “ISO C Amendment 1 (MSE)” over at ³unix.org, about the amendment that brought wchar_t
into the C standard in 1995, the authors maintain that
” The primary advantage to the one byte/one character model is that it is very easy to process data in fixed-width chunks. For this reason, the concept of the wide character was invented. A wide character is an abstract data type large enough to contain the largest character that is supported on a particular platform.
But as it turned out, the C and C++ standards seem to not talk about the largest supported character, but only about the largest extended character sets in the supported locales: that wchar_t
must be large enough to represent every code point in the largest such extended character set – but not Unicode, when there is no Unicode locale.
” [the
wchar_t
type] is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales.
This is almost identically the same wording as in the C++ standard. And it seems to mean that with a restricted set of supported locales, wchar_t
can be smallish indeed, down to a single byte with UTF-8 encoding (a nightmare possibility where e.g. no standard library character classification function would work outside of ASCII's A through Z, but hey). Possibly the following is a requirement to be wider than that:
” A wide character is a code value (a binary encoded integer) of an object of type
wchar_t
that corresponds to a member of the extended character set.
… since it refers to the extended character set, but that term seems to not be further defined anywhere.
And at least with Microsoft's C and C++ runtime there is no Unicode locale: with that implementation setlocale
is restricted to character encodings that have at most 2 bytes per character:
setlocale
:
” The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8,
setlocale
will fail, returningNULL
.
So it seems that contrary to what I thought I knew, and contrary to my assertion, Windows' 16-bit wchar_t
is formally OK. And mainly due to Microsoft's ingenious lack of support for UTF-8 locales, or any locale with more than 2 bytes per character. But is it really so, is 16-bit wchar_t
OK?
Links:
¹ news:comp.lang.c++
² http://unicode.org/notes/tn12/#Software_16
³ http://www.unix.org/version2/whatsnew/login_mse.html
⁴ https://msdn.microsoft.com/en-us/library/x99tb11d.aspx