4

It seems like a rather complicated issue to work with std::string and UTF8 and I cannot find a good explanation of do's and dont's.

How can I properly work with UTF8 in C++? It is rather confusing.

I've found boost::locale and I set the global locale:

std::locale::global(boost::locale::generator()(""));

However, after this what do I need to think about, when can I get problems? Will writing/reading from file work as expected, string comparisons etc...?

So far I'm aware of the following:

  • std::regex/boost::regex will not work, In need to covnert to wide strings and use wregex.
  • boost::algorithm::to_upper will not work, need to use boost::locale::to_upper

Other than that what do I need to be aware of?

LihO
  • 41,190
  • 11
  • 99
  • 167
ronag
  • 49,529
  • 25
  • 126
  • 221
  • right. internally in the application, don't work with UTF-8. C++ standard library is built on the assumption of one encoding value = one character. – Cheers and hth. - Alf Jun 10 '12 at 10:36
  • Then what am I supposed to work with? wstring/UTF16 isn't one encoding value = one character either? – ronag Jun 10 '12 at 10:39
  • @Cheersandhth.-Alf: that assumption is untrue in UCS-4, too. In fact, it's untrue in *any* Unicode encoding. –  Jun 10 '12 at 10:40
  • @Fanael: it's not so much of a practical concern. original ASCII had the same issue with the tilde ~, and suggested underlining via backspaces and underlines. that's also not much of a practical concern. not being able to process INDIVIDUAL BASIC CHARACTERS is a concern. a huge one. in other words, unless your comment is purely argumentative, it appears that you missed the point. in that case, – Cheers and hth. - Alf Jun 10 '12 at 17:06
  • 1
    @Cheersandhth.-Alf: not a practical concern? Do you know how Unicode handles Devanagari? Or even extended Latin more complicated than what's already there precomposed? If you want code points, fine. If you want *individual characters*, it's very much of practical concern. –  Jun 11 '12 at 13:08
  • @Fanael: consider that Windows is the leading PC platform, and that mostly all Windows software is based on using 16-bit `wchar_t` as single characters. in that light, of practically working software, which includes the most used software in the world today, your comment about Davanagari support being a practical concern, whatever the heck Davanagari is, is pretty silly. of course you are free to implement support for the whole galaxy + magellanic clouds before you consider your software useful, but ***do not advice others to do that***, please. – Cheers and hth. - Alf Jun 11 '12 at 14:30
  • 3
    @Cheersandhth.-Alf: yet Windows itself somehow manages to support Hindi, which is a pretty significant language using Devanagari as its native script. So, you're free to implement support for these few languages you happen to know (which all are presumably using a quite restricted subset of extended Latin), but ***do not advice others to do that***, please. Also, [ICU](http://site.icu-project.org/). It makes writing code that handles all these weird scripts bearable. –  Jun 11 '12 at 14:38

1 Answers1

3

Welcome in the magnificent world of Unicode.

  1. Sorry, wchar_t is implementation defined, and typically on Windows will not be sufficient to hold a full code-point for Asiatic scripts (for example)
  2. You can use comparisons for look-up, but to sort data and present them to an audience you will need a full collation algorithm. Know for example that the order in the German dictionary is different from that in the German phone book (and cry...)
  3. Generally speaking, I would advise not transforming the strings by yourself. Boost.Locale algorithms should work in general as they wrap ICU, but otherwise refrain from ad-hoc operations.
  4. If you split the string in several parts, don't split in the middle of words. It's too easy to either split a character in two (even with code-point aware algorithms, because of diacritics), or even avoiding that, split between two characters (because some cultures consider certain combinations of adjacent characters as one).
Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • "typically on Windows will not be sufficient to hold a full code-point for Asiatic scripts" is wrong, CJK scripts are in the BMP. –  Jun 10 '12 at 10:44
  • 3
    @Fanael: most are, but some extensions are in the [Supplementary Ideographic Plane](http://en.wikipedia.org/wiki/Supplementary_Ideographic_Plane#Supplementary_Ideographic_Plane) – Matthieu M. Jun 10 '12 at 11:08
  • Actually, the current Unicode support under MS-Windows uses UTF-16 which supports the full 20 bits necessary to support all Unicode characters (code between D800 and DFFF), including the supplementary ideographic planes. Older versions of MS-Windows (If I'm correct Win2k and older) used UCS-2. – Alexis Wilke Sep 07 '13 at 03:15
  • In regard to splitting, Unicode defines precisely how you can do that. You have to follow the rules, that's all. Of course, that's a bit of work... more info on http://www.unicode.org/ – Alexis Wilke Sep 07 '13 at 03:17