Using Unicode (UTF-8) in C++

Question

Currently, I have to deal with Unicode in C++ 11 (Linux environment). UTF-8 is used as default encoding. Tasks that I need:

Replace.
Regex
Iterate through a UTF-8 string. I don't know if using std::string and "for (character c : s)" will do what I want 'cause each character must be a unicode character. For example ế is one character, mão is a word contains 3 characters
Substring.
Concatenate substring with unicode characters or concatenate unicode characters.
Length.
Trim.
Read and write files.

What library should I use to achieve the best result?

Thank you very much. Looking forward to hearing from you soon.

C++ strings deal only in raw elements (`char` for `std::string`, `wchar_t` for `std::wstring`, `char16_t` for `std::u16string`, `char32_t` for `std::u32string`), they have no concept of character encodings, like UTF-8/16/32 (though `char16_t` is intended for UTF-16, and `char32_t` for UTF-32), which may use more than 1 element per single Unicode codepoint, depending on its value. I will say that a `for (character c : s)` loop WILL not handle Unicode characters outside the BMP for `std::string` or `std::u16string`. but it will for `std::u32string`, and `std::wstring` when `wchar_t` is 4 bytes. — Remy Lebeau, Dec 07 '18 at 03:30
For what you are asking, you really need a good Unicode-aware library. There are plenty of them available if you look around (don't ask here, asking for recommendations are off-topic). — Remy Lebeau, Dec 07 '18 at 03:32
Thank you very much. I did my research before asking. The 3 libs that I found are UTF8CPP, ICU:Unicode, Boost. I don't know which is the best and is there any better out there? — Null Pointer, Dec 07 '18 at 03:33
"best" is subjective. Use whichever one(s) suit your needs. Use multiple ones, if you want. — Remy Lebeau, Dec 07 '18 at 03:38
To input and output UTF-8 with the STL, you can `std::imbue` the `wcin` and `wcout` wide-character streams, and any files you use, with converter facets from ``. If you don’t need to work on individual characters, and are just passing strings through literally, you can just treat them as `char*`. — Davislor, Dec 07 '18 at 04:30
Windows needs a bit of extra magic: you must `_setmode()` to `_O_U8TEXT` and set the console code page to 65001. — Davislor, Dec 07 '18 at 04:32
While I answered, this is several questions asking to recommend a library, so StackOverflow is not necessarily where you’d want to ask this in the future. — Davislor, Dec 07 '18 at 06:36
Are you writing a text-processing application? Your requirements are a vast overkill for any but most demanding and full-featured text processors. And UTF-8 is not necessarily the most appropriate *internal* encoding for such applications (you if course need to read and write UTF-8 files but it's the easiest part of this business). — n. m. could be an AI, Dec 07 '18 at 08:46
@n.m. I'm writing a text-preprocessor. Text from files must be pre-processed before feeding to my ML model. The encoding of input files is UTF-8. One of the task relates to Unicode character (must be checked character by character in a string). — Null Pointer, Dec 07 '18 at 15:58
"Unicode characters" are code points. What you describe is called user-perceived characters or [grapheme clusters](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries). Out of the three libraries you mention, only ICU is capable of this; but I'm not quite sure *why* you would need to consider grapheme clusters. What kind of pre-processing are you doing? — n. m. could be an AI, Dec 08 '18 at 05:52
In some languages, a grapheme cluster is considered as a single character. Some normalization tasks (based-on language knowledges) require grapheme cluster iteration, grapheme clusters replacement, etc. — Null Pointer, Dec 08 '18 at 07:38
You may want to ask a separate question that describes these tasks in detail. It could be as simple as detecting combining characters in [NFD](http://unicode.org/reports/tr15/#Norm_Forms), or more complicated. They probably go far beyond your initial scope of "using UTF-8". — n. m. could be an AI, Dec 08 '18 at 08:20
Thank you. Already done simple tasks (include NFC). Only some tasks that must process grapheme clusters separately remain. I'm trying to use ICU:Unicode to achieve my goal. I hope it'll be okay. — Null Pointer, Dec 08 '18 at 12:27

score 2 · Answer 1 · answered Dec 07 '18 at 06:35

For the regex/replace/search functions, I’ve previously used PCRE. This is designed to work with UTF-8 strings. You might be able to work with STL regular expressions, but not in any portable way. (Windows, in particular, does not support UTF-8 locales.)

Iterating through a UTF-8 string is even more complicated than you describe, if you need to support combining marks or the zero-width joiner! You write that é is one character, but it might be two Unicode codepoints: Latin small letter e + combining acute accent above. If you simply want to iterate through codepoints, you might use mbtowc() or std::codecvt::do_in from the Standard Library. If you need to iterate through graphemes, the most portable way to do that is with ICU.

Regular string concatenation should work, and the standard library has mblen() for length. This isn’t completely portable, because the multibyte encoding does not have to be UTF-8 (although there is a standard set of conversion functions).

Using Unicode (UTF-8) in C++

1 Answers1