Unicode strings on a embedded software

Question

I need to write an app on embedded device using C++. I may need to support Unicode too (though I am not an expert on it). I had a look at Joel Spoolsky's article too about Unicode: http://www.joelonsoftware.com/articles/Unicode.html

My question is given what I mentioned above, what is the way to go with Unicode in such a application in C++? Should I use wchar_t everywhere? or std::wstring?

What problems I may encounter in using wchar_t all the time? (this post mentions some problems one might encounter with unicode strings: Switching from std::string to std::wstring for embedded applications? - but I am still kind of confused as to don't know what to do exactly).

See http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful — dalle, May 16 '13 at 09:14
@dalle: I consider both the linked question and its "accepted" answer to be severely misguided. None of the problems mentioned are inherent to UTF-16, they are inherent to multibyte encodings and applications written in ignorance of multibyte implications. Using UTF-8 instead doesn't really solve the problems, and using UTF-32 still doesn't solve the issue of e.g. combining characters. You want to go beyond ISO-8859, you have to *understand* Unicode, multibyte, *and* the limits of wide characters. No way around it. — DevSolar, May 16 '13 at 09:25
What do you need to *do* with your Unicode strings? Once you start looking at individual characters, things get tricky and you'll need a library with robust Unicode support to do all your string manipulation, but if you just need to store (and maybe concatenate) valid Unicode strings, then you should be fairly safe. — jalf, May 16 '13 at 09:41
@jalf: "What do you need to do with your Unicode strings?" --> yes, this I am not sure exactly yet what I need to do with them though. — pseudonym_127, May 16 '13 at 09:47
@DevSolar: I just wanted to point out that `wchar_t` and `std::wstring` aren't needed to *support* Unicode. I'm sure that using UTF-8 (instead of UTF-16) will on the other hand force developers to think of Code Units much earlier, and not lead them into thinking that a `wchar_t` is a Character or a Code Point. I'm sure of this because it is very likely that they encounter non-ASCII characters far more often than non-BMP characters. And I'm hoping that using UTF-8 will in turn make the developer to think even further of the complexity of Unicode. — dalle, May 17 '13 at 18:03
@dalle: Correct, `wchar_t` and `std::wstring` are not needed - they are woefully inadequate. "Support" is more than just storing data: Conversion, collation, searching... — DevSolar, May 17 '13 at 18:09

DevSolar · Answer 1 · 2013-05-16T10:40:20.920

"Supporting" Unicode goes well beyond using wchar_t or std::wstring (which are merely "types suitable for some wide-character encoding which might or might not be actually Unicode depending on current locale and platform").

Think things like isalpha(), tokenizing, coverting to / from different encodings etc., and you get the idea.

Unless you know you can get away with build-in stuff like wchar_t / std::wstring (and you wouldn't ask in that case), you are better off using the ICU library, which is the state-of-the-art implementation for Unicode support. (Even the otherwise-recommendable Boost.Locale relies on ICU to provide the actual logic.)

The C way of doing Unicode in ICU are arrays of type UChar [] (UTF-16), the C++ way is the class icu::UnicodeString. I happen to work with a legacy codebase that goes great lengths to "make do" with UChar [] for claims of performance (shared references, memory pooling, copy-on-write etc.), but still fails to outperform icu::UnicodeString, so you might feel safe in using the latter even in an embedded environment. They did a good job there.

Post scriptum: Take note that wchar_t is of implementation-defined length; 32bit on the Unixes I know of, and 16bit on Windows - which gives additional trouble since wchar_t should be "wide", but UTF-16 is still "multibyte" when it comes to Unicode. If you can rely on the environment supporting C++11, char16_t resp. char32_t would be better choices, yet still agnostic of finer print like combining characters.

"performance ... Copy On Write". It turns out COW is _not_ good for performance on modern systems. The savings just don't justify the costs. — MSalters, May 16 '13 at 09:50
@MSalters: Not to mention the amount of custom code, complete with memory leaks, lack of suitable documentation etc. -- ignoring existing implementations because you think you can do better "rolling your own" is a recipee for desaster 9 times out of 10. Funny I never came across the 10th time yet in all those years. ;-) — DevSolar, May 16 '13 at 09:56

score 0 · Answer 2 · answered May 16 '13 at 09:07

You've read Joel's article, but it seems you have not understood it. std::wstring or strings of wchar_t are not Unicode, they are wide character strings that may contain UCS-2 or UTF-16 Unicode strings, or something else. std::string may contain plain ASCII, or ANSI w. codepage strings, or they may contain UTF-8 Unicode strings, or something else.

Both of these occur often: the std::wstring tends to be UTF-16 on Windows, std::string tends to be UTF-8 on POSIX.

DevSolar's advice is sound - have a look at ICU instead, it'll save you from an awful lot of headache and misunderstanding.

Unicode strings on a embedded software

2 Answers2