What C++ string classes/systems exist that have good unicode support and a decent interface?

Question

Using strings in C++ development is always a bit more complicated than in languages like Java or scripting languages. I think some of the complexity comes from a performance focus in C++ and some is just historical.

I know of the following major string systems and would like to find out if there are others and what specific drawbacks they have vs. each other:

ICU : http://userguide.icu-project.org/strings#TOC-Using-Unicode-Strings-in-C-
GLib::ustring : http://library.gnome.org/devel/gtkmm-tutorial/unstable/sec-basics-ustring.html.en
MFC CString : http://msdn.microsoft.com/en-us/library/5bzxfsea%28VS.100%29.aspx
std::basic_string : http://en.cppreference.com/w/cpp/string/basic_string
QT QString : http://doc.qt.nokia.com/4.6/qstring.html#details

I'll admit that there can be no definite answer, but I think SOs voting system in uniquely suited to show the preferences (and thus the validity of arguments) of people actually using a certain string system.

Added from answers:

UFT8-CPP : http://utfcpp.sourceforge.net/

If you want votes, make this community wiki. – Jul 29 '10 at 08:33 — , Jul 29 '10 at 08:33
@Neil : Have done so. Makes sense. – Martin Ba Jul 30 '10 at 12:26 — Martin Ba, Jul 30 '10 at 12:26

score 4 · Answer 1 · answered Jul 29 '10 at 08:44

4

You should have a look at UTF8-CPP: UTF-8 with C++ in a Portable Way

It is very lean and has a really neat C++ interface, using the standard std::string as container for the string data, thus avoiding lots of casts for other-than-unicode operations, and providing simple additional functions for unicode handling.

answered Jul 29 '10 at 08:44

Didier Trosset

36,376
13
83
122

4

This is the best drop-in solution I've found. It integrates beautifully with existing C++ tools and idioms, and provides unchecked operations for improved performance. – Jon Purdy Aug 05 '10 at 01:27

score 4 · Accepted Answer · edited May 23 '17 at 10:32

Using strings in C++ development is always a bit more complicated than in languages like Java or scripting languages. I think some of the complexity comes from a performance focus in C++ and some is just historical.

I'd say it's all historical. In particular, two pieces of history:

C was developed back in the days when everyone (even Japan) was using a 7-bit or 8-bit character encoding. Because of this, the concepts of char and "byte" are hopelessly confounded.
C++ programmers quickly recognized the desirability of having a string class rather than just raw char*. Unfortunately, they had to wait 15 years for one to be officially standardized. In the meantime, people wrote their own string classes that we're still stuck with today.

Anyhow, I've used two of the classes you mentioned:

MFC CString

MSDN documentation

There are actually two CString classes: CStringA uses char with "ANSI" encoding, and CStringW uses wchar_t with UTF-16 encoding. CString is a typedef of one of them depending on a preprocessor macro. (Lots of things in Windows come in "ANSI" and "Unicode" versions.)

You could use UTF-8 for the char-based version, but this has the problem that Microsoft refuses to support "UTF-8" as an ANSI code page. Thus, functions like Trim(const char* pszTargets), which depend on being able to recognize character boundaries, won't work correctly if you use them with non-ASCII characters.

Since UTF-16 is natively supported, you'll probably prefer the wchar_t-based version.

Both CString classes have a fairly convenient interface, including a printf-like Format function. Plus the ability to pass CString objects to this varags function, due to the way the class is implemented.

The main disadvantages are:

Slow performance for very large strings. (Last I checked, anyway.)
Lack of integration with the C++ standard library. No iterators, not even << and >> for streams.
It's Windows-only.

(That last point has caused me much frustration since I got put in charge of porting our code to Linux. Our company wrote our own string class that's a clone of CString but cross-platform.)

std::basic_string

The good thing about basic_string is that it's the standard.

The bad thing about it is that it doesn't have Unicode support. OTOH, it doesn't actively not support Unicode, as it lacks member functions like upper() / lower() that would depend on the character encoding. In that sense, it's really more of a "dynamic array of code units" than a "string".

There are libraries that let you use std::string with UTF-8, such as the above-mentioned UTF8-CPP and some of the functions in the Poco library.

For which size characters to use, see std::wstring vs std::string.

score 1 · Answer 3 · answered Jul 29 '10 at 08:45

1

Some random thoughts:

std::basic_string: No unicode support at all, not really usable for platform-independent applications. If your code is intended for a specific platform, you can usually use std::wstring (Windows, UTF-16) or std::string (Unix-like systems, UTF-8) for storing Unicode strings, but everything else (encodings, character properties, Unicode algorithms...) is completely absent.
ICU: Idiosyncratic interface that doesn't blend well with STL algorithms (e.g., a Java-style iterator). Apart from that, ICU seems to be an industry standard and is quite extensive. Uses UTF-16 mainly, but supports other encodings.
Qt: Nice interface that is both practical and STL compatible. Uses UTF-16 internally. Would probably be my first choice if I had to write platform-independent applications in C++.
GLib, MFC: Don't know about those.
Platform-dependent facilities: For very basic tasks (e.g., encodings), you can get along with these (e.g. iconv on Unix-like systems, MultiByteToWideChar on Windows). Pro: No external library required.

answered Jul 29 '10 at 08:45

Philipp

48,066
12
84
109

I use std::string cross-platform all the time. – Jul 29 '10 at 08:48
But then you have to convert to UTF-16 every time you call an OS function. – Philipp Jul 29 '10 at 08:55
1

@Phillip Not on any OS I use (Windows, Linux, UNIX, Solaris) – Jul 29 '10 at 09:00
Please elaborate. All Windows API functions accept only UTF-16, so somewhere there must exist an UTF-8 ↔ UTF-16 conversion. – Philipp Jul 29 '10 at 11:34
@philip: last time I checked the ASCII version of all winapi functions didn't magically disappear. As long as there's no weird characters in the strings you're using on an OS level, no need for fancy Unicode. – rubenvb Jul 29 '10 at 11:40
I think it depends. File operations without Unicode support *are* no-go IMHO. But there are other APIs where not using Unicode may be OK. – Martin Ba Jul 30 '10 at 12:19
@Phillip I don't use Unicode for anything, and don't feel I am missing out. – Jul 30 '10 at 12:39
@Phillip: Only if you are doing internationalization. – Jul 30 '10 at 13:48
I find it interesting how many people keep on using the obsolete 8-bit API. I wish Microsoft had abolished this API long time ago. – Philipp Jul 30 '10 at 14:18
ICU has some support for std::string now, and an increasing number of UTF-8 APIs, as well as efficient routines/macros for converting between UTFs. – Steven R. Loomis Aug 03 '10 at 20:11
4

@Philipp: I'm not really surprised. The problem with UTF-16 is that you have to change a zillion lines of old code that uses `char` strings. And then you *still* have a variable-length encoding. I wish Windows had taken the UTF-8 approach like the Unix-like OSes did. – dan04 Aug 04 '10 at 23:49

What C++ string classes/systems exist that have good unicode support and a decent interface?

3 Answers3

MFC CString

std::basic_string