10

Popular software developers and companies (Joel Spolsky, Fog Creek software) tend to use wchar_t for Unicode character storage when writing C or C++ code. When and how should one use char and wchar_t in respect to good coding practices?

I am particularly interested in POSIX compliance when writing software that leverages Unicode.

When using wchar_t, you can look up characters in an array of wide characters on a per-character or per-array-element basis:

/* C code fragment */
const wchar_t *overlord = L"ov€rlord";
if (overlord[2] == L'€')
    wprintf(L"Character comparison on a per-character basis.\n");

How can you compare unicode bytes (or characters) when using char?

So far my preferred way of comparing strings and characters of type char in C often looks like this:

/* C code fragment */
const char *mail[] = { "ov€rlord@masters.lt", "ov€rlord@masters.lt" };
if (mail[0][2] == mail[1][2] && mail[0][3] == mail[1][3] && mail[0][3] == mail[1][3])
    printf("%s\n%zu", *mail, strlen(*mail));

This method scans for the byte equivalent of a unicode character. The Unicode Euro symbol takes up 3 bytes. Therefore one needs to compare three char array bytes to know if the Unicode characters match. Often you need to know the size of the character or string you want to compare and the bits it produces for the solution to work. This does not look like a good way of handling Unicode at all. Is there a better way of comparing strings and character elements of type char?

In addition, when using wchar_t, how can you scan the file contents to an array? The function fread does not seem to produce valid results.

user1254893
  • 527
  • 3
  • 13
  • 9
    Unicode in C++: don't use `wchar_t`, use a proper Unicode library. – Cat Plus Plus Mar 18 '12 at 10:35
  • 3
    `tend to use wchar_t for Unicode character encoding`. No; they use it for Unicode character _storage_, and there is a big difference. – Lightness Races in Orbit Mar 18 '12 at 10:46
  • possible duplicate of [std::wstring VS std::string](http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring) – 一二三 Mar 18 '12 at 10:46
  • 2
    @LightnessRacesinOrbit: Unfortunately, the C/C++ standards do not require `wchar_t` to be capable of storing Unicode characters, and do not specify how you would figure out the encoding if it does store Unicode characters. – Dietrich Epp Mar 18 '12 at 12:25
  • 4
    The reason Joel Spolsky can use `wchar_t` is because he's not writing portable code and not targeting POSIX: he's assuming that `wchar_t` is UCS-2, which is how it works with "Visual Basic, COM, and Windows NT/2000/XP." Not only is UCS-2 obsolete, the `wchar_t` type is nearly useless on POSIX systems. Few libraries use it, so the only thing you'll ever do with `wchar_t` is turn it into something else (probably UTF-16 or UTF-8). You can't manipulate it easily because you can't portably assume how it's encoded. It's like some big joke the standards committees perpetrated. – Dietrich Epp Mar 18 '12 at 12:33
  • 3
    @DietrichEpp: A joke indeed. `wchar_t` is essentially a second-class citizen. For example, see http://stackoverflow.com/questions/3693479/why-does-c-not-have-an-snwprintf-function – jamesdlin Mar 18 '12 at 13:31
  • Also, nobody is really using UTF-16. The obvious target group for UTF-16 (people who speak CJK, because UTF-16 is more compact for those character ranges than UTF-8, UTF-32 or schemes like SCSU, BOCU) use different encodings entirely! – Mr Lister Mar 18 '12 at 13:55
  • 2
    @DietrichEpp: Quite. It's not particularly _good_ at storing "Unicode" characters, but that is what people "tend" to use it for nonetheless :) – Lightness Races in Orbit Mar 18 '12 at 20:14
  • @Dietrich: wchar_t in Windows stores UTF-16. wchar_t on some (most) *nix platforms stores UTF-32. So no UCS-2, and not obsolete. – Mihai Nita Mar 22 '12 at 08:57
  • @Mr Lister: actually, pretty much everybody uses UTF-16: Mac OS X, Windows, KDE, Qt, Java. The only areas using utf-8 internally are the some of the Linux/*nix CRT (and in many cases without "knowing" is utf-8, just moving bytes around). – Mihai Nita Mar 22 '12 at 09:00
  • So, wchar_t is not a joke. It is indeed a second class citizen because the standard C/C++ runtime library obstinately refuses to acknowledge that all the APIs operating on char_t need equivalent APIs on wchar_t (or uchar16_t, or uchar32_t, or anything other than "bunch of bytes"). – Mihai Nita Mar 22 '12 at 09:04
  • 2
    @MihaiNita: This is not a discussion about UTF-16, this is a discussion about `wchar_t`. You are right that those APIs use UTF-16, but only on Windows is that the same thing as `wchar_t`; so you can't do much with `wchar_t` on Mac OS X because it's UTF-32. Kind of disingenuous to mention KDE and dismiss UTF-8 as irrelevant considering it's used for major parts of many Linux desktops, e.g., Pango and Gtk. The joke about `wchar_t` is that it's supposed to be portable but it's actually less portable than, say, `unsigned short`. At least you know `unsigned short` is 16 bits on POSIX. – Dietrich Epp Mar 22 '12 at 09:31
  • 1
    @MihaiNita: Java is also irrelevant, since we are talking about `wchar_t` which is part of the C and C++ standards. Java does not have a `wchar_t` type. – Dietrich Epp Mar 22 '12 at 09:33
  • There are plenty of myths on this subject. As others said, those "Popular software developers" aren't targeting anything but windows. Once you need to write portable code, using [UTF-8 encoded narrow string everywhere](http://utf8everywhere.org/) is the sanest way to go. – Yakov Galka May 02 '12 at 08:18

3 Answers3

10

If you know that you're dealing with unicode, neither char nor wchar_t are appropriate as their sizes are compiler/platform-defined. For example, wchar_t is 2 bytes on Windows (MSVC), but 4 bytes on Linux (GCC). The C11 and C++11 standards have been a bit more rigorous, and define two new character types (char16_t and char32_t) with associated literal prefixes for creating UTF-{8, 16, 32} strings.

If you need to store and manipulate unicode characters, you should use a library that is designed for the job, as neither the pre-C11 nor pre-C++11 language standards have been written with unicode in mind. There are a few to choose from, but ICU is quite popular (and supports C, C++, and Java).

一二三
  • 21,059
  • 11
  • 65
  • 74
  • 3
    Even C++11 is quite light on the unicode stuff. Beyond mandating a few types and standard conversions between utf8/16/32 you won't find anything like collation, comparison, normalization, etc. – edA-qa mort-ora-y Mar 18 '12 at 11:06
  • Just as an addition, I think C11 here tries to be in sync with C++1 and introduces the same new `char??_t` types. – Jens Gustedt Mar 18 '12 at 11:18
  • Yes, C11 is in sync with C++11 for these types/literals. – 一二三 Mar 18 '12 at 11:22
0

I am particularly interested in POSIX compliance when writing software that leverages Unicode.

In this case, you'll probably want to use UTF-8 (with char) as your preferred Unicode string type. POSIX doesn't have a lot of functions for working with wchar_t — that's mostly a Windows thing.

This method scans for the byte equivalent of a unicode character. The Unicode Euro symbol € takes up 3 bytes. Therefore one needs to compare three char array bytes to know if the Unicode characters match. Often you need to know the size of the character or string you want to compare and the bits it produces for the solution to work.

No, you don't. You just compare the bytes. Iff the bytes match, the strings match. strcmp works just as well with UTF-8 as it does with any other encoding.

Unless you want something like a case-insensitive or accent-insensitive comparison, in which case you'll need a proper Unicode library.

dan04
  • 87,747
  • 23
  • 163
  • 198
0

You should never-ever compare bytes, or even code points, to decide if strings are equal. That's because of a lot of strings can be identical from user perspective without being identical from code point perspective.

Mihai Nita
  • 5,547
  • 27
  • 27