conflicts: definition of wchar_t string in C++ standard and Windows implementation?

Question

From c++2003 2.13

A wide string literal has type “array of n const wchar_t” and has static storage duration, where n is the size of the string as defined below

The size of a wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating L’\0’.

From c++0x 2.14.5

A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below

The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.

The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u’\0’.

The statement in C++2003 is quite vague. But in C++0x, when counting the length of the string, the wide string literal wchar_t shall be treated as same as char32_t, and different from char16_t.

There's a post that states clearly how windows implements wchar_t in https://stackoverflow.com/questions/402283?tab=votes%23tab-top

In short, wchar_t in windows is 16bits and encoded using UTF-16. The statement in standard apparently leaves something conflicting in Windows.

for example,

wchar_t kk[] = L"\U000E0005";

This exceeds 16bits and for UTF-16 it needs two 16 bits to encode it (a surrogate pair).

However, from standard, kk is an array of 2 wchar_t (1 for the universal-name \U000E005, 1 for \0).

But in the internal storage, Windows need 3 16-bit wchar_t objects to store it, 2 wchar_t for the surrogate pair, and 1 wchar_t for the \0. Therefore, from array's definition, kk is an array of 3 wchar_t.

It's apparently conflicting to each other.

I think one simplest solution for Windows is to "ban" anything that requires surrogate pair in wchar_t ("ban" any unicode outside BMP).

Is there anything wrong with my understanding?

Thanks.

score 4 · Accepted Answer · answered Dec 08 '10 at 06:53

4

The standard requires that wchar_t be large enough to hold any character in the supported character set. Based on this, I think your premise is correct -- it is wrong for VC++ to represent the single character \U000E0005 using two wchar_t units.

Characters outside the BMP are rarely used, and Windows itself internally uses UTF-16 encoding, so it is simply convenient (even if incorrect) for VC++ to behave this way. However, rather than "banning" such characters, it is likely that the size of wchar_t will increase in the future while char16_t takes its place in the Windows API.

The answer you linked to is somewhat misleading as well:

On Linux, a wchar_t is 4-bytes, while on Windows, it's 2-bytes

The size of wchar_t depends solely on the compiler and has nothing to do with the operating system. It just happens that VC++ uses 2 bytes for wchar_t, but once again, this could very well change in the future.

answered Dec 08 '10 at 06:53

casablanca

69,683
7
133
150

thank you. i got it now. some time it's just difficult to understand a new concept, but once you got it, it becomes simpler instantly. – user534498 Dec 08 '10 at 07:18
Windows technically uses `WCHAR`, not `wchar_t`. It's been typedeffed as `unsigned short` in the past and could become `char16_t` in the future. But honestly, I don't see that happening - string literals would break. – MSalters Dec 08 '10 at 11:43
@MSalters: Why would string literals break? That's what the `TEXT("...")` macros are there for -- people were never supposed to use raw `L"..."` literals. Also, at least on VS2005, `WCHAR` is a typedef for `wchar_t`, not `unsigned short`. – casablanca Dec 08 '10 at 18:28
1

@casablanca: `TEXT("")` is a `TCHAR[]` literal, not a `WCHAR[]` literal. The `typedef unsigned short WCHAR` was used in VC6 and previous versions. – MSalters Dec 09 '10 at 08:36
@MSalters: I guess the handful of wide-only functions would break, but other than that, most of the functions are based on `TCHAR` so they should be fine. Of course, I don't see this happening either, at least not in the near future, because honestly nobody cares about characters outside the BMP. – casablanca Dec 09 '10 at 15:09
4

Today VC++ is incorrect. But the reason is that at the time when the decision was made that Windows NT should be Unicode, the Unicode standard itself was not going beyond 65536, and there was no mechsnicm to go beyond that. – Mihai Nita Apr 16 '11 at 22:22

score 2 · Answer 2 · answered Dec 08 '10 at 07:49

Windows knows nothing about wchar_t, because wchar_t is a programming concept. Conversely, wchar_t is just storage, and it knows nothing about the semantic value of the data you store in it (that is, it knows nothing about Unicode or ASCII or whatever.)

If a compiler or SDK that targets Windows defines wchar_t to be 16 bits, then that compiler may be in conflict with the C++0x standard. (I don't know whether there are some get-out clauses that allow wchar_t to be 16 bits.) But in any case the compiler could define wchar_t to be 32 bits (to comply with the standard) and provide runtime functions to convert to/from UTF-16 for when you need to pass your wchar_t* to Windows APIs.

conflicts: definition of wchar_t string in C++ standard and Windows implementation?

2 Answers2

Linked