24

Why is wchar_t needed? How is it superior to short (or __int16 or whatever)?

(If it matters: I live in Windows world. I don't know what Linux does to support Unicode.)

CannibalSmith
  • 4,742
  • 10
  • 44
  • 52
  • 2
    Related: [Is wchar_t needed for unicode support?](http://stackoverflow.com/questions/2259544/is-wchar-t-needed-for-unicode-support) – sleske Aug 08 '13 at 14:28

10 Answers10

19

See Wikipedia.

Basically, it's a portable type for "text" in the current locale (with umlauts). It predates Unicode and doesn't solve many problems, so today, it mostly exists for backward compatibility. Don't use it unless you have to.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • 8
    Amen. Dump the ANSI locale stuff entirely, in fact. Treat all text as utf8 (converting on input if you have to) and use the standard C library functions. That's the only sane way to do I18N in C. – Andy Ross Oct 23 '09 at 15:36
  • Unfortunatelly that won't always work. Some implementations of C standard library assume at most 2 bytes per character for multibyte strings and don't support UTF-8 locale. Search Michael Kaplan's blog for more info. – Nemanja Trifunovic Oct 23 '09 at 16:03
  • 4
    Nemanja, Michael Kaplan is a prolific writer. Can you please be a little more specific about what to search for? – Rob Kennedy Oct 23 '09 at 16:05
  • 3
    This is rather wrong, but I can't nail it down precisely. 2 simple counter-examples show a lot. On Windows, the universal encoding for wchar_t aka WCHAR is UTF-16, which is (A) not locale-specific and (B) definitely Unicode-based. On Mac OSX, wchar_t simply holds the Unicode code point. So, definitely not for backwards compatibility, it's how the two most common desktop OSes support Unicode. – MSalters Oct 26 '09 at 12:10
  • @MSalters but it's the worst form of unicode. Takes twice as much space as Ascii and doesn't support all unicode chars. It's only advantage is that it is easy to support the C string library – Martin Beckett Nov 29 '10 at 16:23
  • 1
    Unicode != UTF-8. Most implementations of unicode also use 16bit characters internally because many operations wouldn't be efficient otherwise. So wchar_t is maybe a senile ancestor of Unicode. – Aaron Digulla Nov 30 '10 at 17:39
  • 1
    @Martin Beckett: I presume you refer to Windows' UTF-16 WCHAR. A 32 bits `wchar_t` obviously supports all unicode characters. So does UTF-16, although it takes 2 WCHAR's (4 bytes) for characters outside the BMP set. Google terms: "high low surrogate". UTF-16 is more efficient than UTF-8 for Chinese/Japanese/Korean (most CJK ideographs take 3 bytes in UTF-8) – MSalters Dec 01 '10 at 13:51
17

Why is wchar_t needed? How is it superior to short (or __int16 or whatever)?

In the C++ world, wchar_t is its own type (I think it's a typedef in C), so you can overload functions based on this. For example, this makes it possible to output wide characters and not to output their numerical value. In VC6, where wchar_t was just a typedef for unsigned short, this code

wchar_t wch = L'A'
std::wcout << wch;

would output 65 because

std::ostream<wchar_t>::operator<<(unsigned short)

was invoked. In newer VC versions wchar_t is a distinct type, so

std::ostream<wchar_t>::operator<<(wchar_t)

is called, and that outputs A.

sbi
  • 219,715
  • 46
  • 258
  • 445
  • BTW: This behavior can be disabled in the project settings in new VCs (you should not but maybe it's needed for backwards compatibility) – mmmmmmmm Oct 23 '09 at 19:58
10

The reason there's a wchar_t is pretty much the same reason there's a size_t or a time_t - it's an abstraction that indicates what a type is intended to represent and allows implementations to chose an underlying type that can represent the type properly on a particular platform.

Note that wchar_t doesn't need to be a 16 bit type - there are platforms where it's a 32-bit type.

Michael Burr
  • 333,147
  • 50
  • 533
  • 760
  • Note that in C++, wchar_t is a built-in type (like char), whereas size_t and time_t are typedefs. – rdb Oct 10 '15 at 18:09
  • @rdb Do you know what the rationale for making it a builtin type is? – Petr Skocik Jan 12 '19 at 19:34
  • 1
    @PSkocik Making it a built-in type makes it possible to overload functions for it, so that you can make different overloads accepting a `wchar_t` and whatever integer type happens to overlap in size with `wchar_t`. C doesn't have overloading, so didn't need that. – rdb Jan 13 '19 at 00:39
8

It is usually considered a good thing to give things such as data types meaningful names.

What is best, char or int8? I think this:

char name[] = "Bob";

is much easier to understand than this:

int8 name[] = "Bob";

It's the same thing with wchar_t and int16.

Thomas Padron-McCarthy
  • 27,232
  • 8
  • 51
  • 75
  • 1
    wchar_t is not always the same size as int16, however. It is a type that varies in width from platform to platform, unfortunately... – fbrereto Oct 23 '09 at 22:50
  • Which is why C++0x introduces char16_t and char32_t, so you can use UTF16 or UCS4 explicitly while still retaining character semantics. – Porculus Nov 29 '10 at 16:37
6

wchar_t is the primitive for storing and processing the platform's unicode characters. Its size is not always 16 bit. On unix systems wchar_t is 32 bit (maybe unix users are more likely to use the klingon charaters that the extra bits are used for :-).

This can pose problems for porting projects especially if you interchange wchar_t and short, or if you interchange wchar_t and xerces' XMLCh.

Therefore having wchar_t as a different type to short is very important for writing cross-platform code. Cleaning up this was one of the hardest parts of porting our application to unix and then from VC6 to VC2005.

icedwater
  • 4,701
  • 3
  • 35
  • 50
iain
  • 10,798
  • 3
  • 37
  • 41
6

As I read the relevant standards, it seems like Microsoft fcked this one up badly.

My manpage for the POSIX <stddef.h> says that:

  • wchar_t: Integer type whose range of values can represent distinct wide-character codes for all mem‐ bers of the largest character set specified among the locales supported by the compilation environment: the null character has the code value 0 and each member of the portable character set has a code value equal to its value when used as the lone character in an integer character constant.

So, 16 bits wchar_t is not enough if your platform supports Unicode. Each wchar_t is supposed to be a distinct value for a character. Therefore, wchar_t goes from being a useful way to work at the character level of texts (after a decoding from the locale multibyte, of course), to being completely useless on Windows platforms.

gnud
  • 77,584
  • 5
  • 64
  • 78
  • 3
    I don't think that's a problem in Microsoft's implementation, but rather that the C++ spec doesn't really account for Unicode. What is a character set in Unicode? Does `wchar_t` have to be able to represent all Unicode code points, or just all code *units*? In the case of UTF16, a code unit is a 16-bit integer, and all of these can be represented by Microsoft's `wchar_t`. – jalf Oct 23 '09 at 14:41
  • 2
    I think wide strings (`L"blah"`) are UTF-16 encoded on Windows. So it is able to represent full Unicode, but is a multi-byte encoding (at least for some of the Unicode characters). ICBWT. – sbi Oct 23 '09 at 14:42
  • 1
    If it's a multi-byte encoding, then it's 'range of values' can't really hold distinct values for all members of the character set, can it? – gnud Oct 25 '09 at 09:24
  • @gnud: You're right, of course, Windows can only represent UCS-2 in `wchar_t` characters. I was thinking in terms of `wchar_t` strings, not `wchar_t` characters. – sbi Oct 26 '09 at 21:25
  • 5
    @jalf - the whole point of `wchar_t` is to decode mutlibyte encodings into a simple representation with one character in each array position. The largest character set specified on Windows is Unicode. UTF-16 is not a character set, it's an encoding of Unicode. – gnud Oct 26 '09 at 22:53
  • 1
    I think the reason they have 16bit `wchar_t` is that they used to do UCS-2, only, in earlier versions of their OS. – sbi Oct 27 '09 at 08:31
  • 1
    It's not useless on Windows. It's useful for calling all those UTF-16-based WinAPI functions. But it is problematic that Windows doesn't have a "character" type that can *actually represent a character*. Until C++0x, anyway. – dan04 Jun 10 '10 at 05:51
  • 2
    And what is a "character"? Even if you have a 32-bit wchar_t, the presence of combining forms means that your string may use multiple codepoints to represent what the user thinks of as a single character. – Porculus Nov 29 '10 at 16:41
5

To add to Aaron's comment - in C++0x we are finally getting real Unicode char types: char16_t and char32_t and also Unicode string literals.

Nemanja Trifunovic
  • 24,346
  • 3
  • 50
  • 88
2

It is "superior" in a sense that it allows you to separate contexts: you use wchar_t in character contexts (like strings), and you use short in numerical contexts (numbers). Now the compiler can perform type checking to help you catch situations where you mistakenly mix one with another, like pass an abstract non-string array of shorts to a string processing function.

As a side node (since this was a C question), in C++ wchar_t allows you to overload functions independently from short, i.e. again provide independent overloads that work with strings and numbers (for example).

AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
2

wchar_t is a bit of a hangover from before unicode standardisation. Unfortunately it's not very helpful because the encoding is platform specific (and on Solaris, locale-specific!), and the width is not specified. In addition there are no guarantees that utf-8/16/32 codecvt facets will be available, or indeed how you would access them. In general it's a bit of a nightmare for portable usage.

Apparently c++0x will have support for unicode, but at the current rate of progress that may never happen...

Robert Tuck
  • 312
  • 1
  • 9
1

Except for a small, ISO 2022 japanese minority, wchar_t is always going to be unicode. If you are really anxious you can make sure of that at compile time:

#ifndef __STDC_ISO_10646__
#error "non-unicode wchar_t, unsupported system"
#endif

Sometimes wchar_t is 16bits UCS-2 sometimes 32bits UCS-4, so what? Just use sizeof(wchar_t). wchar_t is NOT meant to be sent to disk nor to the network, it is only meant to be used in memory.

See also Should UTF-16 be considered harmful? on this site.

Community
  • 1
  • 1
MarcH
  • 18,738
  • 1
  • 30
  • 25
  • 1
    `__STDC_ISO_10646__` indicates is that values of wchar_t are the same as values of Unicode code points. That condition doesn't hold for the Unicode encodings UTF-16 and UTF-8, while it _does_ hold for ASCII and UCS-2. – bames53 Jun 13 '12 at 15:37
  • No matter what `__STDC_ISO_10646__` says wchar_t is NEVER supposed to store UTF-16 (nor any other encoded form), this is a clear violation of the POSIX standard quoted above. Storing UCS-2 is OK. On platforms which don't care about standards all bets are off. – MarcH Mar 29 '13 at 23:13
  • It's not that cut and dry. If you're referring to the quote that says all characters in the largest character set among supported locales are represented by distinct wchar_t values that doesn't rule out UTF-16 any more than it rules out UCS-2; As long as no locale supports non-BMP characters then if those non-BMP characters aren't represented by distinct wchar_t values it's not technically a violation. Of course if locale support for characters was the only kind of support then you wouldn't be able to tell the difference, but it isn't. – bames53 Mar 30 '13 at 22:12
  • Almost the entire wchar.h API is based on locales; so using wchar_t outside locales sounds really crazy. Also, how should a function like wctomb() behave when given half a UTF-16 character input? Sorry but UTF-16 in wchar_t is a really too serious abuse of it. Which must actually be why all the Windows developers in this thread seem to hate it. – MarcH Apr 02 '13 at 10:24
  • Windows' provides its own `wchar_t` APIs which are not limited to characters supported by the current C or C++ locale. `wctomb()` is only defined to handle characters supported by the locale, and characters supported by a locale are required to be represented by distinct `wchar_t` values. As long as no locale supports any character outside the BMP then there is technically no violation of the standard. I won't argue that UTF-16 isn't an abuse and a violation of the spirit of wchar_t, only that it is not technically a violation of the letter of the specification. – bames53 Apr 02 '13 at 16:01
  • And if you're interested you can see how I would answer the OP's question: http://stackoverflow.com/a/11107667/365496 – bames53 Apr 02 '13 at 16:04
  • @bames53 let's just say I like your distinction between "spirit" and "technical violation"! – MarcH Apr 03 '13 at 09:19