19

This intrigues me, so I'm going to ask - for what reason is wchar_t not used so widely on Linux/Linux-like systems as it is on Windows? Specifically, the Windows API uses wchar_t internally whereas I believe Linux does not and this is reflected in a number of open source packages using char types.

My understanding is that given a character c which requires multiple bytes to represent it, then in a char[] form c is split over several parts of char* whereas it forms a single unit in wchar_t[]. Is it not easier, then, to use wchar_t always? Have I missed a technical reason that negates this difference? Or is it just an adoption problem?

ismail
  • 46,010
  • 9
  • 86
  • 95

4 Answers4

21

wchar_t is a wide character with platform-defined width, which doesn't really help much.

UTF-8 characters span 1-4 bytes per character. UCS-2, which spans exactly 2 bytes per character, is now obsolete and can't represent the full Unicode character set.

Linux applications that support Unicode tend to do so properly, above the byte-wise storage layer. Windows applications tend to make this silly assumption that only two bytes will do.

wchar_t's Wikipedia article briefly touches on this.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • 24
    Windows uses UTF-16 which does not make the assumption that two bytes are enough. UTF-16 can represent the entirety of Unicode. [UTF-16's Wikipedia article](http://en.wikipedia.org/wiki/UTF-16) briefly touches on this :-) – Joey Jan 03 '11 at 21:06
  • 14
    On the other hand, lots of Linux apps make the "silly assumption" that UTF-8 means they don't have to change anything to make their code operate correctly w.r.t. the unicode standard, and can still use plain `char *`s everywhere and not pay attention to things. – Billy ONeal Jan 03 '11 at 21:07
  • 3
    Fair enough. I think the real reason is that it's useful when dealing with character encoding to have some standard, guaranteed fixed-width type to work with underneath your encoding layer. `char` gives you that (it's defined to be a single byte, and you can use as many as you like!); `wchar_t` does not. – Lightness Races in Orbit Jan 03 '11 at 21:08
  • 2
    If you operate on the byte level, then `char` is appropriate. If you operate on the code-point level, then you need a UTF-32 code unit. – Joey Jan 03 '11 at 21:10
  • 13
    @Joey: Yes, and that's exactly why windows UTF-16 is no better than UTF-8 in the end : you can't predict character size. Henceforth you can't move by a given number of char inside strings. So what's the point of using two times the space when writing english messages ? – kriss Jan 03 '11 at 21:12
  • 7
    @kriss @Tomalak @Joey: Do keep in mind that when "Unicode" was added to Win32, 2 bytes was enough to encode any code point. (NT3.51 shipped well before 1996, when UTF-16 was introduced) This is why Windows uses UTF-16 now -- they had already decided to use wchar_t, and they couldn't break the entire API. Also, even if your app is using UCS-2 only, you still can encode most any language in modern use without difficulty. – Billy ONeal Jan 03 '11 at 21:20
  • 6
    @kriss: Legacy. Windows has used UCS-2 from the very beginning and moving on to UTF-16 is the most sensible thing to do. Java has a similar legacy in that regard. Back then UCS-2 *could* represent all of Unicode with code units and code points being equivalent – which in itself is a very nice thing to have, regardless of storage requirements for text (and Unicode text is very likely not the biggest part that eats your HDD space). So no real surprise *why* that design choice was made. *(read on)* – Joey Jan 03 '11 at 21:20
  • 1
    *(continued)* However, people have learned from past mistakes and I doubt an operating system designed from the ground up today would use UCS-4. The need is simply not that pressing. Depending on how you look at text or a given problem you need bytes, code units, code points or even graphemes. And in every case the proper choice of type is different. – Joey Jan 03 '11 at 21:21
  • 5
    Well, linux has the advantage that using CRT string functions will be wrong in ~80% of the world, windows' is wrong in ~3% of the world, depending on how fancy the Chinese customer gets. That does force linux programmers to always use the expensive but correct codepoint iterator. If utf-16 is considered too expensive, I wonder how often that happens. – Hans Passant Jan 03 '11 at 22:14
  • 1
    @Hans Passant: At least this problem is fixed in C++0x's CRT. – Billy ONeal Jan 04 '11 at 04:10
9

The first people to use UTF-8 on a Unix-based platform explained:

The Unicode Standard [then at version 1.1] defines an adequate character set but an unreasonable representation [UCS-2]. It states that all characters are 16 bits wide [no longer true] and are communicated and stored in 16-bit units. It also reserves a pair of characters (hexadecimal FFFE and FEFF) to detect byte order in transmitted text, requiring state in the byte stream. (The Unicode Consortium was thinking of files, not pipes.) To adopt this encoding, we would have had to convert all text going into and out of Plan 9 between ASCII and Unicode, which cannot be done. Within a single program, in command of all its input and output, it is possible to define characters as 16-bit quantities; in the context of a networked system with hundreds of applications on diverse machines by different manufacturers [italics mine], it is impossible.

The italicized part is less relevant to Windows systems, which have a preference towards monolithic applications (Microsoft Office), non-diverse machines (everything's an x86 and thus little-endian), and a single OS vendor.

And the Unix philosophy of having small, single-purpose programs means fewer of them need to do serious character manipulation.

The source for our tools and applications had already been converted to work with Latin-1, so it was ‘8-bit safe’, but the conversion to the Unicode Standard and UTF[-8] is more involved. Some programs needed no change at all: cat, for instance, interprets its argument strings, delivered in UTF[-8], as file names that it passes uninterpreted to the open system call, and then just copies bytes from its input to its output; it never makes decisions based on the values of the bytes...Most programs, however, needed modest change.

...Few tools actually need to operate on runes [Unicode code points] internally; more typically they need only to look for the final slash in a file name and similar trivial tasks. Of the 170 C source programs...only 23 now contain the word Rune.

The programs that do store runes internally are mostly those whose raison d’être is character manipulation: sam (the text editor), sed, sort, tr, troff, (the window system and terminal emulator), and so on. To decide whether to compute using runes or UTF-encoded byte strings requires balancing the cost of converting the data when read and written against the cost of converting relevant text on demand. For programs such as editors that run a long time with a relatively constant dataset, runes are the better choice...

UTF-32, with code points directly accessible, is indeed more convenient if you need character properties like categories and case mappings.

But widechars are awkward to use on Linux for the same reason that UTF-8 is awkward to use on Windows. GNU libc has no _wfopen or _wstat function.

dan04
  • 87,747
  • 23
  • 163
  • 198
4

UTF-8, being compatible to ASCII, makes it possible to ignore Unicode somewhat.

Often, programs don't care (and in fact, don't need to care) about what the input is, as long as there is not a \0 that could terminate strings. See:

char buf[whatever];
printf("Your favorite pizza topping is which?\n");
fgets(buf, sizeof(buf), stdin); /* Jalapeños */
printf("%s it shall be.\n", buf);

The only times when I found I needed Unicode support is when I had to have a multibyte character as a single unit (wchar_t); e.g. when having to count the number of characters in a string, rather than bytes. iconv from utf-8 to wchar_t will quickly do that. For bigger issues like zero-width spaces and combining diacritics, something more heavy like icu is needed—but how often do you do that anyway?

user502515
  • 4,346
  • 24
  • 20
  • 1
    More common is case-insensitive comparison. But Linux doesn't need it for filenames. – dan04 Jan 04 '11 at 01:41
  • 2
    @dan04: And case-insensitive comparision is problematic anyways, because doing it properly means depending on the locale/culture (e.g. an uppercase `i` in Turkish is *not* an `I`)... which is why the only reasonable option is to have it case-sensitive, IMO. – Tim Čas Sep 04 '16 at 20:11
3

wchar_t is not the same size on all platforms. On Windows it is a UTF-16 code unit that uses two bytes. On other platforms it typically uses 4 bytes (for UCS-4/UTF-32). It is therefore unlikely that these platforms would standardize on using wchar_t, since it would waste a lot of space.

Joey
  • 344,408
  • 85
  • 689
  • 683
villintehaspam
  • 8,540
  • 6
  • 45
  • 76
  • 1
    Well, it could also be a UTF-16 surrogate pair. – Billy ONeal Jan 03 '11 at 21:05
  • 1
    Storing surrogates in `wchar_t` is not only non-conformant, but makes it impossible to implement a UTF-8 multibyte encoding or any multibyte encoding that supports non-BMP characters with the standard library `mbrtowc` function. See http://stackoverflow.com/questions/3228828/how-to-best-deal-with-windows-16-bit-wchar-t-ugliness – R.. GitHub STOP HELPING ICE Jan 04 '11 at 04:03
  • @R.: I don't see how it could be "non-conformant". What are you stipulating it should conform to? The `wctombr` and `mbrtowc` functions don't define the encodings on which they operate anyway so one cannot rely on their particular behavior in that respect. – Billy ONeal Jan 04 '11 at 18:17
  • 4
    ISO C Amendment 1. The character set that `wchar_t` uses is deliberately unspecified, but whatever it is, `wchar_t` needs to be large enough to represent any character. So UCS-2 and UTF-32 are acceptable `wchar_t` encodings, but UTF-16 is not. – dan04 Jan 05 '11 at 03:20
  • 1
    Why is UTF-16 unacceptable for `wchar_t`? It works fine, as long as you interpret "character" to mean codeunit and not codepoint. A UTF-16 encoded string, even one that uses surrogates, can be represented with `wchar_t`, as long as each codeunit has its own `wchar_t` element within the string. – Remy Lebeau Jan 06 '11 at 02:13
  • 5
    @Remy: Because the `mbrtowc` function *cannot behave as specified* when a single multibyte character must translate to two or more `wchar_t` values. See the question I linked. – R.. GitHub STOP HELPING ICE May 25 '11 at 23:27