9

In the ¹comp.lang.c++ Usenet group I recently asserted, based on what I thought I knew, that Windows' 16-bit wchar_t, with UTF-16 encoding where sometimes two such values (called a “surrogate pair”) is needed for a single Unicode code point, is invalid for representing Unicode.

It's certainly inconvenient and in conflict with the assumption of the C and C++ standard libraries (e.g. character classification) that each code point is represented as a single value, although the Unicode consortium's ²Technical Note 12 from 2004 makes a good case for using UTF-16 for internal processing, with an impressive list of software that does.

And certainly it seems as if the original intent was to have one wchar_t value per code point, consistent with the assumptions of the C and C++ standard libraries. E.g. in the web page “ISO C Amendment 1 (MSE)” over at ³unix.org, about the amendment that brought wchar_t into the C standard in 1995, the authors maintain that

The primary advantage to the one byte/one character model is that it is very easy to process data in fixed-width chunks. For this reason, the concept of the wide character was invented. A wide character is an abstract data type large enough to contain the largest character that is supported on a particular platform.

But as it turned out, the C and C++ standards seem to not talk about the largest supported character, but only about the largest extended character sets in the supported locales: that wchar_t must be large enough to represent every code point in the largest such extended character set – but not Unicode, when there is no Unicode locale.

C99 §7.17/2 (from the N869 draft):

[the wchar_t type] is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales.

This is almost identically the same wording as in the C++ standard. And it seems to mean that with a restricted set of supported locales, wchar_t can be smallish indeed, down to a single byte with UTF-8 encoding (a nightmare possibility where e.g. no standard library character classification function would work outside of ASCII's A through Z, but hey). Possibly the following is a requirement to be wider than that:

C99 §7.1.1/4:

A wide character is a code value (a binary encoded integer) of an object of type wchar_t that corresponds to a member of the extended character set.

… since it refers to the extended character set, but that term seems to not be further defined anywhere.

And at least with Microsoft's C and C++ runtime there is no Unicode locale: with that implementation setlocale is restricted to character encodings that have at most 2 bytes per character:

MSDN ⁴documentation of setlocale:

The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL.

So it seems that contrary to what I thought I knew, and contrary to my assertion, Windows' 16-bit wchar_t is formally OK. And mainly due to Microsoft's ingenious lack of support for UTF-8 locales, or any locale with more than 2 bytes per character. But is it really so, is 16-bit wchar_t OK?


Links:
¹ news:comp.lang.c++
² http://unicode.org/notes/tn12/#Software_16
³ http://www.unix.org/version2/whatsnew/login_mse.html
https://msdn.microsoft.com/en-us/library/x99tb11d.aspx

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • The C standard reads a bit different, see 7.19p2. `UTF-8` is not a valid encoding and beyond the C standard. `[unsigned/signed] char` and `wchar_t` are fixed-width values. – too honest for this site Sep 17 '16 at 15:27
  • @Olaf: I take it you're referring to C11. Can you quote that? I don't have it. – Cheers and hth. - Alf Sep 17 '16 at 15:34
  • Feel free to bookmark: http://port70.net/~nsz/c/c11/n1570.html – too honest for this site Sep 17 '16 at 15:35
  • http://jrgraphix.net/research/unicode_blocks.php - this shows that the full unicode set exceeds 16 bits. Should be possible to find a similar link, somewhere on unicode.org... – Sam Varshavchik Sep 17 '16 at 15:36
  • @SamVarshavchik: Yes, iirc, it is 21 bits now, but an implementaion can restrict to support less code-pages. – too honest for this site Sep 17 '16 at 15:37
  • Sure, but that means that the answer to whether a 16-bit value is "valid for representing full Unicode" is "no". – Sam Varshavchik Sep 17 '16 at 15:39
  • 1
    @SamVarshavchik: Does the question ask about full support? "And mainly due to Microsoft's ingenious lack of support for UTF-8 locales, or any locale with more than 2 bytes per character. **But is it really so**, is 16-bit wchar_t OK?" I'd say the answer is yes, it is ok. Bad design (not the first one), but formally correct. – too honest for this site Sep 17 '16 at 15:41
  • Just checked: Linux/x64 uses 32 bits. That should suffice for a lot new emoticons and other icons ... :-) – too honest for this site Sep 17 '16 at 15:42
  • I interpret the "full Unicode" reference in the title as referring to the entire Unicode range. And I have a mental block accepting the notion that a 16 bit value is sufficient to support the full Unicode range, which is more than 16 bits. – Sam Varshavchik Sep 17 '16 at 15:43
  • @SamVarshavchik: Hmm, yes, the question is ambigous. Title does not match the last paragraph. OP should clarify. – too honest for this site Sep 17 '16 at 15:44
  • @SamVarshavchik: The 16-bit encoding of Unicode is called UTF-16. This is the encoding used in Windows, as referred to in the question (and explained in the first paragraph). With a restriction to the BMP it's called UCS-2. – Cheers and hth. - Alf Sep 17 '16 at 15:44
  • @Cheersandhth.-Alf: Not the DVer, but can you please clarify if you mean the full set or the MS-support restricted? – too honest for this site Sep 17 '16 at 15:44
  • @Cheersandhth.-Alf: UTF-16 has the same problems as UTF-8 with regard to the standard. Both are variable-length encodings. – too honest for this site Sep 17 '16 at 15:45
  • @Olaf: Re " the full set or the MS-support restricted", the question title mentions "full Unicode". I am not aware of an MS restriction of Unicode. MS is a founding member of the Unicode Consortium. – Cheers and hth. - Alf Sep 17 '16 at 15:46
  • @Cheersandhth.-Alf: THat never kept them from doing their own thing. Anyway, I just cited what you wrote: "And mainly due to Microsoft's ingenious lack of support for UTF-8 locales, **or any locale with more than 2 bytes per character**." - that implies they use a fixed-length encoding. No idea what happens if there ever was some larger CP used. – too honest for this site Sep 17 '16 at 15:48
  • 4
    IIRC MS started using `wchar_t` when UCS2 was still a thing. Windows is now UTF-16 but for backward compatibility reasons ... – Richard Critten Sep 17 '16 at 15:49
  • 4
    You make it sound like there were a fixed-width Unicode encoding. There isn't. Even with UTF-32 a single code point may be represented by 2 UTF-32 code units. Regardless of the encoding you choose, your code will always have to be prepared to deal with multi-code-unit code points. – IInspectable Sep 17 '16 at 15:53
  • 2
    @IInspectable: So Wikipedia is wrong? (serious question) https://en.wikipedia.org/wiki/UTF-32 – too honest for this site Sep 17 '16 at 15:58
  • @Olaf: Can you provide any MS induced restrictions in the UTF-16 implementation of Windows? And no, that quote from the question doesn't imply fixed-length encoding. It's Microsoft's MBCS encoding (which is really DBCS). – IInspectable Sep 17 '16 at 15:58
  • 2
    @Olaf: Wikipedia may not be entirely wrong, but it doesn't tell you the entire truth either. A decomposed code point takes 2 UTF-32 code units. This is hinted to in the article as well: *"Editors that limit themselves to left-to-right languages and **precomposed characters** can take advantage of fixed-sized code units"*. – IInspectable Sep 17 '16 at 16:01
  • @IInspectable: I solely operated on OPs question/assumptions about wchar_t. Re. UTF-32: your statement contradicts the WP entry in German and English. No offence, but I'd need more information about how & when a single code-point would take two UTF-32 CUs. It is not that I'm deeply in that, your assertion made me curious, as I always had in mind what I now read on WP. – too honest for this site Sep 17 '16 at 16:05
  • 1
    From the standard @Olaf points to: `wchar_t` should be able to use "sequences of multibyte characters". That contradicts "an abstract data type large enough to contain the largest character", right? If it did, you would not need to provide supporting code for a *sequence*. (Unless I'm interpreting the wording wrong and the sequence mentioned is that of "2 bytes" :) – Jongware Sep 17 '16 at 16:16
  • 2
    @Olaf: U+0041 (Latin Capital Letter A) followed by U+0308 (Combining Diaeresis) is certainly an easy to follow example, where a single code point would require 2 UTF-32 code units, to represent the code point U+00C4 (Latin Capital Letter A With Diaeresis). – IInspectable Sep 17 '16 at 16:21
  • 5
    @IInspectable UTF-32 is a fixed length encoding. The example you point is a case of combining code points. https://en.wikipedia.org/wiki/Combining_character Strictly speaking, code points are fixed size. – dimm Sep 17 '16 at 16:32
  • 1
    (@IInspectable: no, that falls under *normalization*, not *encoding*.) – Jongware Sep 17 '16 at 16:33
  • @IInspectable: As I'm not sure how a code-point is exactly defined, I'll leave it to tyou experts to discuss this matter. But right now it looks like you are in error. Anyway, thanks to all for the input. – too honest for this site Sep 17 '16 at 16:55
  • 1
    @RadLexus: That assumes that a normal form always exists. There is none for n̈ (as in [*Spın̈al Tap*](https://en.wikipedia.org/wiki/Spinal_Tap_(band))). – IInspectable Sep 17 '16 at 17:07
  • 2
    @IInspectable no it does not assume that a normal form exists. UTF-32 is a fixed length encoding for code points, not characters. – Stuart Sep 17 '16 at 19:49
  • @Stuart: You mean UTF-32 is fixed length encoding. I'm not sure about the Unicode terminology regarding character versus glyph. There's something there. But code point is simple enough. It's what a Unicode code means. :) – Cheers and hth. - Alf Sep 17 '16 at 19:51
  • @Cheersandhth.-Alf thanks I meant UTF-32 not UTF-16. – Stuart Sep 17 '16 at 19:54
  • 2
    @Stuart: Unless the quoted language standard in this question equates *character* with *Unicode code point*, the distinction is not relevant. The point really is: No matter which Unicode character encoding you choose, there are 'characters' in any encoding, that cannot be represented by a single code unit. (Then again, maybe the question should be: What's the definition of a character in C and C++?) – IInspectable Sep 17 '16 at 20:09
  • 3
    @IInspectable I get that programs are usually concerned with characters and not code points, so yes I agree that in practise, most programs using UTF-32 may still have to be prepared to handle multiple code units per character, but the point is that technically UTF-32 is a fixed length encoding. Also, you were in error when you wrote "Even with UTF-32 a single code point may be represented by 2 UTF-32 code units." No, a single code point may be decomposed into two different code points, each of which is encoded in a UTF-32 code unit. – Stuart Sep 17 '16 at 20:47
  • "Is 16-bit wchar_t formally valid for representing full Unicode?" --> IMO No. ( opinion, thus comment and not answer) MS bet early on with 16-bit Unicode before surrogates were added and stayed with UTF16 when they were added to be _different_ [embrace enhance extinguish](https://en.wikipedia.org/wiki/Embrace,_extend_and_extinguish) In 2016 MS is still not C99 compliant. It makes little difference. In the future, there will only be ASCII, UTF8 and UTF32 and 32-bit `wchat_t` - that's it. Support for all other paths with dwindle and die off. – chux - Reinstate Monica Sep 17 '16 at 21:32
  • C++11 has `char32_t`, and a number of ways to convert to and from it. On Windows, though, you would need to convert to UTF-16 (or use the limited support for UTF-8) to do anything with your strings in the underlying OS, and that will always be true for any program that uses the API. It is certainly possible to convert between UTF-8 for I/O and UCS32 for internal processing. – Davislor Sep 17 '16 at 23:12
  • 2
    @IInspectable: "*Unless the quoted language standard in this question equates character with Unicode code point, the distinction is not relevant.*" FYI: The **Unicode** standard equates "character" with "codepoint". Go ahead; look it up. They even call their database of codepoints the "[Unicode *Character* Database](http://unicode.org/ucd/)". The Unicode standard makes a distinction between character/codepoints and *glyphs* and grapheme clusters. But they constantly treat "character" as equivalent to "codepoint". – Nicol Bolas Sep 17 '16 at 23:29
  • 2
    @chux: MS stuck with UTF-16, because customers don't understand why switching to a 'better' character encoding would break their old software. There's no inherent advantage to any other character encoding. It's an artifact, private to the implementation. Besides, UTF-16 is here to stay. .NET strings use it, or Delphi, and more notably, Java, and JavaScript. – IInspectable Sep 18 '16 at 00:09
  • So goes the web [2016](https://w3techs.com/technologies/overview/character_encoding/all) 87% UTF8 1% UTF16, so goes the neighborhood. – chux - Reinstate Monica Sep 18 '16 at 01:32
  • 1
    @chux UTF-8 is more commonly used for **storage and transmission** purposes, but UTF-16 is more commonly used for **processing** purposes. Most languages and APIs, even web-based APIs, DOMs, etc use UTF-16 strings in memory. – Remy Lebeau Sep 18 '16 at 17:10
  • 1
    @Remy Lebeau Any reference to support " UTF-8 is more commonly used for ... but UTF-16 is more commonly used for processing purposes."? Any ref to indicate that is the stable or growing trend? – chux - Reinstate Monica Sep 18 '16 at 17:21
  • `wchar_t` is hopelessly broke given its implementation history attempting to solve characters sets wider than 16 bits This post exemplifies its troubles and unclear usage. C's `char32_t` may provide a future path for portable software using Unicode, See also `__STDC_UTF_16__` , `__STDC_UTF_32__`. – chux - Reinstate Monica Sep 18 '16 at 17:28
  • 1
    @chux experience. Many Internet protocols use UTF-8 nowadays to transfer data, but most (not all) common programming languages/APIs use UTF-16 in memory. Case in point - your Web example. The Web may transfer HTML from server to browser in UTF-8, but VBScript/Javascript, browser DOMs, HTML APIs, etc access/represent the HTML content in UTF-16. – Remy Lebeau Sep 18 '16 at 17:35
  • @chux `wchar_t` is not suitable for portable code because it is not the same byte size on all platforms. Yes, one could use `char32_t` for UTF-32, but at the cost of using larger amounts of memory. There is also `char16_t` for UTF-16 on all platforms. – Remy Lebeau Sep 18 '16 at 17:38
  • 1
    related: https://stackoverflow.com/a/11107667/995714 – phuclv May 19 '18 at 13:50

3 Answers3

5

wchar_t is not now and never was a Unicode character/code point. The C++ standard does not declare that a wide-string literal will contain Unicode characters. The C++ standard does not declare that a wide-character literal will contain a Unicode character. Indeed, the standard doesn't say anything about what wchar_t will contain.

wchar_t can be used with locale-aware APIs, but those are only relative to the implementation-defined encoding, not any particular Unicode encoding. The standard library functions that take these use their knowledge of the implementation's encoding to do their jobs.

So, is a 16-bit wchar_t legal? Yes; the standard does not require that wchar_t be sufficiently large to hold a Unicode codepoint.

Is a string of wchar_t permitted to hold UTF-16 values (or variable width in general)? Well, you are permitted to make strings of wchar_t that store whatever you want (so long as it fits). So for the purposes of the standard, the question is whether standard-provided means for generating wchar_t characters and strings are permitted to use UTF-16.

Well, the standard library can do whatever it wants to; the standard offers no guarantee that a conversion from any particular character encoding to wchar_t will be a 1:1 mapping. Even char->wchar_t conversion via wstring_convert is not required anywhere in the standard to produce a 1:1 character mapping.

If a compiler wishes to declare that the wide character set consists of the Base Multilingual Plane of Unicode, then a declaration like this L'\U0001F000' will produce a single wchar_t. But the value is implementation-defined, per [lex.ccon]/2:

The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined.

And of course, C++ doesn't allow to use surrogate pairs as a c-char; \uD800 is a compile error.

Where things get murky in the standard is the treatment of strings that contain characters outside of the character set. The above text would suggest that implementations can do what they want. And yet, [lex.string]\16 says this:

The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.

I say this is murky because nothing says what the behavior should be if a c-char in a string literal is outside the range of the destination character set.

Windows compilers (both VS and GCC-on-Windows) do indeed cause L"\U0001F000" to have an array size of 3 (two surrogate pairs and a single NUL terminator). Is that legal C++ standard behavior? What does it mean to provide a c-char to a string literal that is outside of the valid range for a character set?

I would say that this is a hole in the standard, rather than a deficiency in those compilers. It should make it more clear what the conversion behavior in this case ought to be.


In any case, wchar_t is not an appropriate tool for processing Unicode-encoded text. It is not "formally valid" for representing any form of Unicode. Yes, many compilers implement wide-string literals as a Unicode encoding. But since the standard doesn't require this, you cannot rely on it.

Now obviously, you can stick whatever will fit inside of a wchar_t. So even on platforms where wchar_t is 32-bits, you could shove UTF-16 data into them, with each 16-bit word taking up 32-bits. But you couldn't pass such text to any API function that expects the wide character encoding unless you knew that this was the expected encoding for that platform.

Basically, never use wchar_t if you want to work with a Unicode encoding.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • 6
    This answer misses the spirit of the question. Instead of asking "is wchar_t guaranteed to be Unicode?", they are asking "is Microsoft violating the C++ standard by using wchar_t for variable-length UTF-16 with surrogate pairs?". With Windows programming, wchar_t is inherently UTF-16; this has been true for 20 years and it isn't going to change, so it is perfectly fine to use wchar_t strings as UTF-16. Furthermore, it is very unlikely that someone writing UTF-16 code for the express purpose of using it in a Windows program is trying to write it in a portable manner (apart from using `wchar.h`) – andlabs Sep 18 '16 at 01:02
  • 1
    I guess another way to phrase the question in the context of your answer is "is UTF-16 with surrogate pairs *permitted* by the C++ standard as a possible character set for wide strings?". – andlabs Sep 18 '16 at 01:04
  • @andlabs: Permitted in what way? By which functions? Or as the output of which operations? – Nicol Bolas Sep 18 '16 at 01:07
  • At all, by standard functions and with both `L'xxx'` and `L"xxx"` constants in the target encoding of the output binary (I forget what the standard calls this encoding). For example, if the standard specifically had a requirement that the wide string encoding was not variable-length, or required that a conformant implementation that used Unicode as a character set required `L''` to be represented by exactly one `wchar_t` value, then the answer would be "yes, Microsoft is violating the C++ standard". – andlabs Sep 18 '16 at 01:09
  • @Stuart: "*wchar_t is a library/compiler thing*" Nobody is claiming otherwise. The question is whether the behavior of Windows-based compilers with regard to their standard-required handling of `wchar_t` is conformant with the C++ standard. – Nicol Bolas Sep 19 '16 at 19:37
  • @andlabs: Added citations from the standard. – Nicol Bolas Sep 19 '16 at 20:12
  • The standard quote beginning "The size of a char32_t or wide string literal " is very interesting. It may be just what I was looking for. – Cheers and hth. - Alf Sep 20 '16 at 13:45
0

After clarifying what the question is I will do an edit.

Q: Is the width of 16 bits for wchar_t in Windows conformant to the standard?

A: Well, lets see. We will start with the definition of wchar_t from c99 draft.

... largest extended character set specified among the supported locales.

So, we should look what are the supported locales. For that there are Three steps:

  1. We check the documentation for setlocale
  2. We quickly open the documentation for the locale string. We see the format of the string

    locale :: "locale_name"
            | "language[_country_region[.code_page]]"
            | ".code_page"
            | "C"
            | ""
            | NULL
    
  3. We see the list of supported Code pages and we see UTF-8, UTF-16, UTF-32 and what not. We're in a dead end.

If we start with the C99 definition, it ends with

... corresponds to a member of the extended character set.

The word "character set" is used. But if we say UTF-16 code units are our character set, then all is OK. Otherwise, it's not. It's kinda vague, and one should not care much. The standards were defined many years ago, when Unicode was not popular.

At the end of the day, we now have C++11 and C11 that define use cases for UTF-8, 16 and 32 with the additional types char16_t and char32_t.


You need to read about Unicode and you will answer the question yourself.

Unicode is a character set. Set of characters, it's about 200000 characters. Or more precisely it is a mapping, mapping between numbers and characters. Unicode by itself does not mean this or that bit width.

Then there are 4 encodings, UTF-7, UTF-8, UTF-16 and UTF-32. UTF stands for Unicode transformation format. Each format defines a code point and a code unit. Code point is an actual charter from Unicode and can consists of one or more units. Only UTF-32 has one unit per point.

On the other hand, each unit fits into a fixed size integer. So UTF-7 units are at most 7 bits, UTF-16 units are at most 16 bits etc.

Therefore, in a 16 bit wchar_t string we can hold Unicode text encoded in UTF-16. Particularly in UTF-16 each point takes one or two units.

So the final answer, in a single wchar_t you can not store all Unicode char, only the single unit ones, but in a string of wchar_t you can store any Unicode text.

Jonathan Wakely
  • 166,810
  • 27
  • 341
  • 521
dimm
  • 1,792
  • 11
  • 15
0

Let's start from first principles:

(§3.7.3) wide character: bit representation that fits in an object of type wchar_t, capable of representing any character in the current locale

(§3.7) character: 〈abstract〉 member of a set of elements used for the organization, control, or representation of data

That, right away, discards full Unicode as a character set (a set of elements/characters) representable on 16-bit wchar_t.

But wait, Nicol Bolas quoted the following:

The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.

and then wondered about the behavior for characters outside the execution character set. Well, C99 has the following to say about this issue:

(§5.1.1.2) Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation- defined member other than the null (wide) character.8)

and further clarifies in a footnote that not all source characters need to map to the same execution character.

Armed with this knowledge, you can declare that your wide execution character set is the Basic Multilingual Plane, and that you consider surrogates as proper characters themselves, not as mere surrogates for other characters. AFAICT, this means you are in the clear as far as Clause 6 (Language) of ISO C99 cares.

Of course, don't expect Clause 7 (Library) to play along nicely with you. As an example, consider iswalpha(wint_t). You cannot pass astral characters (characters outside the BMP) to that function, you can only pass it the two surrogates. And you'd get some nonsensical result, but that's fine because you declared the surrogate themselves to be proper members of the execution character set.

Community
  • 1
  • 1
ninjalj
  • 42,493
  • 9
  • 106
  • 148
  • Keep in mind that there is no Unicode locale with Microsoft's compilers. Or with other compilers such as g++ that use the MS runtime. So the "That, right away, discards full Unicode as a character set ... representable on 16-bit wchar_t", if a valid inference, would discard everything but the very start of Unicode, namely Latin-1, range 0 through 255, as representable with `wchar_t`, regardless of its size. But the inference is not valid. It takes an implication A=>B and uses it the wrong way as if it said B=>A. – Cheers and hth. - Alf Sep 21 '16 at 21:12