Working with UTF8

Question

It seems like a rather complicated issue to work with std::string and UTF8 and I cannot find a good explanation of do's and dont's.

How can I properly work with UTF8 in C++? It is rather confusing.

I've found boost::locale and I set the global locale:

std::locale::global(boost::locale::generator()(""));

However, after this what do I need to think about, when can I get problems? Will writing/reading from file work as expected, string comparisons etc...?

So far I'm aware of the following:

std::regex/boost::regex will not work, In need to covnert to wide strings and use wregex.
boost::algorithm::to_upper will not work, need to use boost::locale::to_upper

Other than that what do I need to be aware of?

right. internally in the application, don't work with UTF-8. C++ standard library is built on the assumption of one encoding value = one character. — Cheers and hth. - Alf, Jun 10 '12 at 10:36
Then what am I supposed to work with? wstring/UTF16 isn't one encoding value = one character either? — ronag, Jun 10 '12 at 10:39
@Cheersandhth.-Alf: that assumption is untrue in UCS-4, too. In fact, it's untrue in *any* Unicode encoding. — , Jun 10 '12 at 10:40
@Fanael: it's not so much of a practical concern. original ASCII had the same issue with the tilde ~, and suggested underlining via backspaces and underlines. that's also not much of a practical concern. not being able to process INDIVIDUAL BASIC CHARACTERS is a concern. a huge one. in other words, unless your comment is purely argumentative, it appears that you missed the point. in that case, — Cheers and hth. - Alf, Jun 10 '12 at 17:06
@Cheersandhth.-Alf: not a practical concern? Do you know how Unicode handles Devanagari? Or even extended Latin more complicated than what's already there precomposed? If you want code points, fine. If you want *individual characters*, it's very much of practical concern. — , Jun 11 '12 at 13:08
@Fanael: consider that Windows is the leading PC platform, and that mostly all Windows software is based on using 16-bit `wchar_t` as single characters. in that light, of practically working software, which includes the most used software in the world today, your comment about Davanagari support being a practical concern, whatever the heck Davanagari is, is pretty silly. of course you are free to implement support for the whole galaxy + magellanic clouds before you consider your software useful, but ***do not advice others to do that***, please. — Cheers and hth. - Alf, Jun 11 '12 at 14:30
@Cheersandhth.-Alf: yet Windows itself somehow manages to support Hindi, which is a pretty significant language using Devanagari as its native script. So, you're free to implement support for these few languages you happen to know (which all are presumably using a quite restricted subset of extended Latin), but ***do not advice others to do that***, please. Also, [ICU](http://site.icu-project.org/). It makes writing code that handles all these weird scripts bearable. — , Jun 11 '12 at 14:38

score 3 · Answer 1 · answered Jun 10 '12 at 10:42

3

Welcome in the magnificent world of Unicode.

Sorry, wchar_t is implementation defined, and typically on Windows will not be sufficient to hold a full code-point for Asiatic scripts (for example)
You can use comparisons for look-up, but to sort data and present them to an audience you will need a full collation algorithm. Know for example that the order in the German dictionary is different from that in the German phone book (and cry...)
Generally speaking, I would advise not transforming the strings by yourself. Boost.Locale algorithms should work in general as they wrap ICU, but otherwise refrain from ad-hoc operations.
If you split the string in several parts, don't split in the middle of words. It's too easy to either split a character in two (even with code-point aware algorithms, because of diacritics), or even avoiding that, split between two characters (because some cultures consider certain combinations of adjacent characters as one).

answered Jun 10 '12 at 10:42

Matthieu M.

287,565
48
449
722

"typically on Windows will not be sufficient to hold a full code-point for Asiatic scripts" is wrong, CJK scripts are in the BMP. – Jun 10 '12 at 10:44
3

@Fanael: most are, but some extensions are in the [Supplementary Ideographic Plane](http://en.wikipedia.org/wiki/Supplementary_Ideographic_Plane#Supplementary_Ideographic_Plane) – Matthieu M. Jun 10 '12 at 11:08
Actually, the current Unicode support under MS-Windows uses UTF-16 which supports the full 20 bits necessary to support all Unicode characters (code between D800 and DFFF), including the supplementary ideographic planes. Older versions of MS-Windows (If I'm correct Win2k and older) used UCS-2. – Alexis Wilke Sep 07 '13 at 03:15
In regard to splitting, Unicode defines precisely how you can do that. You have to follow the rules, that's all. Of course, that's a bit of work... more info on http://www.unicode.org/ – Alexis Wilke Sep 07 '13 at 03:17

Working with UTF8

1 Answers1

Linked