In utf-8, the code point (character) ä
consists of two code units (which are 1 byte in utf-8). C++ does not have support for treating strings as sequence of code points. Therefore, as far the standard library is concerned, std::string("ä").size()
is 2.
A simple approach is to use std::wstring
. wstring
uses a character type (wchar_t
) which is at least as wide as the widest character set supported by the system. Therefore, if the system supports a wide enough encoding to represent any (non-composite) unicode character with a single code unit, then string methods will behave as you would expect. Currently utf-32 is wide enough and is supported by (most?) unix like OS.
A thing to note is that Windows only supports utf-16 and not utf-32, so if you choose wstring
approach and port your program to Windows and a user of your program tries to use unicode characters that are more than 2 bytes wide, then the presumption of one code unit per code point does not hold.
The wstring
approach also doesn't take control or composite characters into consideration.
Here's a little test code which converts a std::string
containing a multi byte utf-8 character ä
and converts it to a wstring
:
string foo("ä"); // read however you want
wstring_convert<codecvt_utf8<wchar_t>> converter;
wstring wfoo = converter.from_bytes(foo.data());
cout << foo.size() << endl; // 2 on my system
cout << wfoo.size() << endl; // 1 on my system
Unfortunately, libstdc++ hasn't implemented <codecvt>
which was introduced in c++11 as of gcc-4.8 at least. If you can't require libc++, then similar functionality is probably in Boost.Locale.
Alternatively, if you wish to keep your code portable to systems that don't support utf-32, you can keep using std::string
and use an external library for iterating and counting and such. Here's one: http://utfcpp.sourceforge.net/ and another: http://site.icu-project.org/. I believe this is the recommended approach.