13

I know all about std::string and std::wstring but they don't seem to fully pay attention to extended character encoding of UTF-8 and UTF-16 (On windows at least). There is also no support for UTF-32.

So does anyone know of cross-platform drop-in replacement classes that provide full UTF-8, UTF-16 and UTF-32 support?

Goz
  • 61,365
  • 24
  • 124
  • 204

7 Answers7

11

And let's not forget the lightweight, very user-friendly, header-only UTF-8 library UTF8-CPP. Not a drop-in replacement, but can easily be used in conjunction with std::string and has no external dependencies.

Jon Purdy
  • 53,300
  • 8
  • 96
  • 166
9

Well in C++0x there are classes std::u32string and std::u16string. GCC already partially supports them, so you can already use them, but streams support for unicode is not yet done Unicode support in C++0x.

Community
  • 1
  • 1
UmmaGumma
  • 5,633
  • 1
  • 31
  • 45
  • Hmm I hadn't noticed that in the new standard. Very interesting. A big shame that I can't use it on compilers lacking C++0x support (such as the iPhone compiler). It genuinely shocks me that these classes don't already exist ... – Goz Feb 01 '11 at 12:10
  • Interestingly, though, it seems that GCC > 4.4 and VS2010 both support it. Which is brilliant. On the major platforms that covers windows, linux and the Android mobile platform. Clang also states that "many" examples work ... – Goz Feb 01 '11 at 12:15
  • 2
    @Goz Well not everything is as good as you think. VS2010 supports unicode strings, but it doesn't support unicode string literals. u"Hello" is UTF-16 string literal and U"Hello" is UTF-32 literal. Visual studio don't recognize them. And also as I already said gcc doesn't support input, output streams yet. – UmmaGumma Feb 01 '11 at 12:21
7

It's not STL, but if you want proper Unicode in C++, then you should take a look at ICU.

Cat Plus Plus
  • 125,936
  • 27
  • 200
  • 224
  • Looks interesting. Shame there is no STL string support from it though ... it would be perfect in that case ... – Goz Feb 01 '11 at 11:37
  • read about it, but after spending some time with DB2, I'd think twice before touching anything coming from IBM. Have you worked with it? Is it good? – davka Feb 01 '11 at 13:21
  • 1
    @Goz: I could not agree more, unicode is "standard" enough that we could wish for a string that do more than storing byte sequences... – Matthieu M. Feb 01 '11 at 13:21
3

There is no support of UTF-8 on the STL. As an alternative youo can use boost codecvt:

//...
// My encoding type
typedef wchar_t ucs4_t;

std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

// Set a New global locale
std::locale::global(utf8_locale);

// Send the UCS-4 data out, converting to UTF-8
{
    std::wstringstream oss;
    oss.imbue(utf8_locale);
    std::copy(ucs4_data.begin(),ucs4_data.end(),
        std::ostream_iterator<ucs4_t,ucs4_t>(oss));

    std::wcout << oss.str() << std::endl;
}
vz0
  • 32,345
  • 7
  • 44
  • 77
  • Its not really a sropin replacement though ;) Ideally i'd love to see something like std::string8, std::string16 and std::string32 ... – Goz Feb 01 '11 at 11:43
2

For UTF-8 support, there is the Glib::ustring class. It is modeled after std::string but is utf-8 aware,e.g. when you are scanning the string with an iterator. It also has some restrictions, e.g. the iterator is always const, as replacing a character can change the length of the string and so it can invalidate other iterators.

ustring does not automatically converts other encodings to utf-8, Glib library has various conversion functions for this. You can validate whether the string is a valid utf-8 though.

And also, ustring and std::string are interchangeable, i.e. ustring has a cast operator to std::string so you can pass a ustring as a parameter where an std::string is expected, and vice versa of course, as ustring can be constructed from std::string.

davka
  • 13,974
  • 11
  • 61
  • 86
2

Qt has QString which uses UTF-16 internally, but has methods for converting to or from std::wstring, UTF-8, Latin1 or locale encoding. There is also the QTextCodec class which can convert QStrings to or from basically anything. But using Qt for just strings seems like an overkill to me.

Sergei Tachenov
  • 24,345
  • 8
  • 57
  • 73
  • Yeah, alas, you are totally right on using it purely for strings. I like qt though and do use it for quite a bit :) – Goz Feb 01 '11 at 14:24
1

Also look at http://grigory.info/UTF8Strings.About.html it is UTF8 native.