16

So I've finally gotten back to my main task - porting a rather large C++ project from Windows to the Mac.

Straight away I've been hit by the problem where wchar_t is 16-bits on Windows but 32-bits on the Mac. This is a problem because all of the strings are represented by wchar_t and there will be string data going back and forth between Windows and Mac machines (in both on-disk data and network data forms). Because of the way in which it works it wouldn't be totally straightforward to convert the strings into some common format before sending and receiving the data.

We've also really started to support a lot more languages recently and so we're starting to deal with a lot of Unicode data (as well as dealing with right-to-left languages).

Now, I could be conflating multiple ideas here and causing more problems for myself than needed which is why I'm asking this question. We're thinking that storing all of our in-memory string data as UTF-8 makes a lot of sense. It solves the wchar_t being different sizes problem, it means we can easily support multiple languages and it also dramatically reduces our memory footprint (we have a LOT of - mostly English - strings loaded) - but it doesn't seem like many people are doing this. Is there something we're missing? There's the obvious problem you have to deal with where string length can be less than the memory size storing that string data.

Or is using UTF-16 a better idea? Or should we stick to wchar_t and write code to convert between wchar_t and, say, Unicode in places where we read/write to the disk or the network?

I realize this is dangerously close to asking for opinions - but we're nervous that we're overlooking something obvious because it doesn't seem like there are many Unicode string classes (for example) - but yet there's plenty of code for converting to/from Unicode like in boost::locale, iconv, utf-cpp and ICU.

user438380
  • 193
  • 1
  • 6

4 Answers4

8

Always use a protocol defined to the byte when a file or network connection is involved. Do not rely on how a C++ compiler stores anything in memory. For Unicode text, this means choosing both an encoding and a byte order (okay, UTF-8 doesn't care about byte order). Even if the platforms you currently want to support have similar architectures, another popular platform with different behavior or even a new OS for one of your existing platforms will likely come along, and you'll be glad you wrote portable code.

aschepler
  • 70,891
  • 9
  • 107
  • 161
2

I tend to use UTF-8 as the internal representation. You only lose string length checking, with isn't really useful anyways. For Windows API conversion, I use my own Win32 conversion functions I devised here. As Mac and linux are (for the most part standard UTF-8-aware, no need to convert anything there). Free bonuses you get:

  1. use plain old std::string.
  2. byte-wise network/stream transport.
  3. For most languages, nice memory footprint.
  4. For more functionality: utf8cpp
Community
  • 1
  • 1
rubenvb
  • 74,642
  • 33
  • 187
  • 332
  • 3
    UTF-8 does **not** allow you to use "plain old `std::string`". Maybe if all you want to do is store the string that's okay, but you cannot actually modify the string in that form without writing your own UTF-8 processing garbage if you use that container. (i.e. you cannot use member functions like `std::string::find` and expect them to work correctly with UTF-8 strings) Too many people think "Oh, I'll just use UTF-8" and think they can just continue treating everything like character arrays, which is false. – Billy ONeal Nov 14 '10 at 06:48
  • 5
    @Billy: That is true for any multibyte encoding. std::string is a container of chars, not glyphs and it is perfectly fine to keep UTF-8 encoded text in std::string and process it with something like utf8cpp – Nemanja Trifunovic Nov 15 '10 at 02:36
  • 2
    @Nemanja: Yes, it's fine to use a std::string for storage, but you could technically *store* anything in a std::string (so long as you could provide a dummy `std::char_traits` facet for it). However, when you say "You can use plain old std::string", people are going to assume they can actually use the class for anything other than data storage. If **just storage** is what you're after, then you should probably use `vector` instead. – Billy ONeal Nov 15 '10 at 16:11
  • @BillyONeal In fact you can use `string::find` with UTF-8 as long as you're using it to find a specific sequence of code points and not 'equivalent' strings (e.g., composed vs. decomposed sequences) – bames53 Nov 30 '11 at 16:13
  • @bames53: You can use it to find a code point. You cannot use it to find a character. Composed characters are composed of multiple code points, and the same character may be represented by a large number of possible forms. Storing UTF-8 in a `std::string` is just a bad idea -- it promotes bad habits because many of `std::string`'s members are not unicode safe. – Billy ONeal Nov 30 '11 at 16:26
  • @BillyONeal Yes, that's what I said: it works for finding code points. Those same string methods that aren't safe for UTF-8 strings aren't safe for _any_ multi-byte character set. But platforms do use multi-byte character sets for their char encoding, whether it's a legacy encoding on Windows like Shift-JIS or UTF-8 as on most Unix platforms. – bames53 Nov 30 '11 at 17:50
  • @BillyONeal The problem of characters being possibly composed of multiple code points is not specific to UTF-8 in `std::string`. It applies equally to UTF-32 in `std::u32string`. As far as I know no significant API has ever bothered to create a string class that prevents, for example, accidentally splitting a code point sequence that makes up one glyph. Furthermore Unicode safe algorithms can be implemented for UTF-8 as easily as for any Unicode encoding. So I can see no special risk to using UTF-8 and no special benefit to using anything else. – bames53 Nov 30 '11 at 17:56
  • @bames53: It's quite easy to write a string class that exposes it's contents in terms of glyphs. As I said, you could use `std::string` as the method for storing things in that container, but doing so doesn't get you anything more than you get from `vector`. – Billy ONeal Nov 30 '11 at 20:16
  • 1
    @BillyONeal Yes, I know that can be done, and done as easily for UTF-8 as any other Unicode encoding. What I'm disagreeing with is that there's some special drawback to using UTF-8 in std::string. You haven't described your preferred alternative that presumably solves this, but the drawbacks you described apply to wchar_t*, std::wstring, char16/32_t*, std::u16/32string, C#'s String, MFC's CString, ICU's UnicodeString, NSString, and pretty much everything else as far as I can tell. – bames53 Nov 30 '11 at 21:45
  • I still see what options are *wrong*, but less is about how to do it *right*. And I'd like to know not only about *storing* UTF-8 strings, but also *comparing* them, searching for particular characters, and classifying Unicode characters in them (like the plain old `isalpha`, `isdigit` etc. have been doing). Is it possible at all in C++? – SasQ Mar 29 '13 at 00:35
  • @SasQ not in plain C++. You'll need something like ICU for that. Remember you're opening a whole can of `` here, including locale, language, and normalisation issues. Unicode is not simple, ever. – rubenvb Mar 29 '13 at 08:02
  • @rubenvb in visual studio, when I do `std::string msg = "महसुस"`, I cannot view it. And everything is replaced by question mark. Any idea? – Pritesh Acharya Apr 24 '14 at 05:40
1

As a rule of thumb: UTF-16 for processing, UTF-8 for communication & storage.

Sure, any rule can be broken and this one is not carved in stone. But you have to know when it is ok to break it.

For instance it might be a good idea to use something else if the environment you are using wants something else. But Mac OS X APIs use UTF-16, same as Windows. So UTF-16 makes more sense. It is more straightforward to convert before you put/get things on the net (because you probably do it in 2-3 routines) than doing all the conversions to call OS APIs.

It also matter the type of application you develop. If it is something with very little text processing, and very little calls to the system (something like an email server that mostly moves things around without changing them), then UTF-8 might be a good choice.

So, as much as you might hate this answer, "it depends".

Mihai Nita
  • 5,547
  • 27
  • 27
1

ICU has a C++ string class, UnicodeString

Steven R. Loomis
  • 4,228
  • 28
  • 39
  • 2
    ICU is a nice library for this kind of stuff. Unfortunately it's also **huge** (Compiled size of ICU is some 25MB). That may be okay in some cases, but it's (of course) not okay in others. Some people don't actually need all the features it provides. OTOH, anyone implementing what it does themselves usually gets it wrong (things like collation are different per locale, and ICU handles that stuff correctly) – Billy ONeal Nov 14 '10 at 06:50
  • A lot of that is data for 500 locales and hundreds of converters, and all possible libraries. It's pretty easily customizable from the data and code point of view, if you don't need everything. The core icuuc library for example is about 1.4MB not including data. – Steven R. Loomis Nov 15 '10 at 16:24