How to manage Unicode strings easily in C++

Question

I want to get each character from a Unicode string. If this question is a bad one, I hope your understanding.

string str = "öp";
for (int i = 0; i < str.length(); i++) {
 cout << str[i] << endl;
}

In this case, str[0] is a broken character because the length of ö is 2. How can I manage it? I really appreciate your answers. Thank you.

@AlexF No, absolutely do not do that. `wchar_t` is fundamentally broken. — Konrad Rudolph, Feb 10 '20 at 10:52
If you're on Linux, I think string(`char` to be exact) will anyways be unicode-compliant. — theWiseBro, Feb 10 '20 at 11:16
@theWiseBro Most modern Linux distributions do indeed use unicode and the UTF-8 encoding by *default*. That doesn't mean they *always* do. You can change that and programs should be able to cope. — Jesper Juhl, Feb 10 '20 at 11:23
@theWiseBro It isn’t, and it can’t. `char` is merely a byte, and `char` strings serve as byte buffer storage. They are encoding-agnostic. This means that they are a suitable storage medium that can represent all possible Unicode code points, but they do not allow encoding-specific access to the data. In particular, accessing individual `char`s does not necessarily resolve individual Unicode code points or glyphs, which is what OP wants. You’ll need to use a Unicode aware text library. — Konrad Rudolph, Feb 10 '20 at 11:25
[please check this answer it may help .](https://stackoverflow.com/questions/246806/i-want-to-convert-stdstring-into-a-const-wchar-t) — akshay chaudhari, Feb 10 '20 at 11:42
What exactly do you want to do? The code example you show just prints a string to the console. You don't need to do it a byte at a time then, provided your terminal supports Unicode. — gavinb, Feb 10 '20 at 11:48
You can try and use a string class that supports utf-8. There are libraries have such. Probably lots of em. I believe fmt has support for it. — ALX23z, Feb 10 '20 at 12:05
@KonradRudolph To add to that, if it is of any help to anyone, search keywords are "unicode segmentation" and "unicode normalization", as you often expect unicode to be in a single "normal" form, and iterating over what we understand as "characters" on screen is nice. People often suggest [ICU](http://site.icu-project.org/download), but there's also Courier Mail Server that separately provides standalone [Courier Unicode Library](http://www.courier-mta.org/download.html) that is smaller than ICU. — , Feb 10 '20 at 15:56

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

In order to insert characters (for example new-lines such as you attempt in the example) between characters of a UTF-8 string, you must only do so between complete grapheme clusters. Right now you add newline after an incomplete code point, which breaks the encoding.

The Unicode standard is here. See this section in particular:

3.9 Unicode Encoding Forms

UTF-8

Table 3-6. UTF-8 Bit Distribution

+----------------------------+------------+-------------+------------+-------------+
|        Scalar Value        | First Byte | Second Byte | Third Byte | Fourth Byte |
+----------------------------+------------+-------------+------------+-------------+
| 00000000 0xxxxxxx          | 0xxxxxxx   |             |            |             |
| 00000yyy yyxxxxxx          | 110yyyyy   | 10xxxxxx    |            |             |
| zzzzyyyy yyxxxxxx          | 1110zzzz   | 10yyyyyy    | 10xxxxxx   |             |
| 000uuuuu zzzzyyyy yyxxxxxx | 11110uuu   | 10uuzzzz    | 10yyyyyy   | 10xxxxxx    |
+----------------------------+------------+-------------+------------+-------------+

From these, we can devise the following algorithm to iterate code points:

for (int i = 0; i < str.length();) {
    std::cout << str[i];

    if(str[i] & 0x80) {
        std::cout << str[i + 1];
        if(str[i] & 0x20) {
            std::cout << str[i + 2];
            if(str[i] & 0x10) {
                std::cout << str[i + 3];
                i += 4;
            } else {
                i += 3;
            }
        } else {
            i += 2;
        }
    }  else {
        i += 1;
    }
    
    std::cout << std::endl;
}

This trivial algorithm is sufficient for your example if it is normalised in a composed form i.e. "ö" is a single code point. For general usage however, more complex algorithm is needed to distinguish grapheme clusters.

Furthermore, this trivial algorithm doesn't check for invalid sequences and may overflow the input string in such case. This is only a simple example not intended for production use. For production use, I would recommend using an external library.

this breaks up code-points, but doesn't take care of grapheme-clusters. If someone were to make a "ö" out of "o" "combining diaeresis" then this would still seperate the two and still produce a "broken character" — PeterT, Feb 10 '20 at 12:50
@PeterT Oh, damn it. It's like it would be better to use someone elses implementation :) I'll add explanation to the answer, but I won't implement that. — eerorika, Feb 10 '20 at 12:53
@eerorika re: `C++ standard library has no functionality to help iterate code points of unicode or any other variable width encoding` https://en.cppreference.com/w/cpp/locale/codecvt/length allows you to iterate through codepoints, as you can find out how many bytes encode next codepoint. That's not making it easier to iterate over grapheme clusters, though. I've realized it midway while trying to write an example, so I'm not going to make it an answer on its own, but the beginning of it can be found [here](https://pastebin.com/aHsDRm3f). — , Feb 10 '20 at 13:02
@dovvei I'll take the claim off the answer. That said, `std::locale("en_US.utf8")` used in the example is not very portable as far as I understand, because locale names are system specific. — eerorika, Feb 10 '20 at 13:05
Normalised is tangential, what you want is fully composed into single codepoints, which [NFC and NFKC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization) often are. It isn't always even possible though. — Deduplicator, Feb 10 '20 at 13:48

score 1 · Answer 2 · answered Feb 10 '20 at 13:37

The problem is that utf-8 (not unicode) is a multi byte character encoding. Most common characters (the ansi character set) only use one single byte, but less common ones (notably emoticons) can use up to 4. But that is far from being the only problem.

If you only use characters from the Basic Multilingual Plane, and can be sure to never encounter combining ones, you can safely use std::wstring and wchar_t, because wchar_t is guaranteed to contain any characters from the BMP.

But in the generic case, Unicode is a mess. Even when using char32_t which can contain any unicode code point, you cannot be sure to have a bijection between unicode code points and graphemes (displayed characters). For example the LATIN SMALL LETTER E WITH ACUTE (é) is the Unicode character U+E9. But it can be represented in a decomposed form as U+65 U+0301, or LATIN SMALL LETTER E followed with a COMBINING ACUTE ACCENT. So even when using char32_t, you get 2 characters for one single grapheme, and it would be incorrect to split them:

wchar32_t eaccute = { 'e', 0x301, 0};

This is indeed a representation of é. You can copy and paste it to control that it is not the U+E9 character, but the decomposed one, but in printed form there cannot be any difference.

TL/DR: Except if you are sure to only use a subset of the Unicode charset that could be represented in a much shorter charset as ISO-8859-1 (Latin1), or equivalent, you have no simple way to know how to split a string in true characters.

To be fair, language is a mess, and Unicode tries valiantly to make the best of a bad deal. — Deduplicator, Feb 10 '20 at 13:50
@Deduplicator: I know, and Unicode growed from the Windows version to the current one with many additions. But because of that there is no simple *page* decomposition that would allow to easily identify 0 width code points which would be enough for OP question. — Serge Ballesta, Feb 10 '20 at 13:58

Joop Eggen · Answer 3 · 2020-02-10T14:10:29.797

The "atomic" unit of a string object evidently is another string (containing a single code point) or an char32_t (Unicode code point). The string being the most usable as one can again compose it, and no UTF conversion is needed.

I am a bit rusty in C/C++, but something like:

string utf8_codepoint(const string& s, int i) {

    // Skip continuation bytes:
    while (s[i] & 0xC0 == 0x80) {
        ++i;
    }

    string cp = s[i];
    if (s[i] & 0xC0 == 0xC0) { // Start byte.
        ++i;
        while (s[i] & 0xC0 == 0x80) { // Continuation bytes.
            cp += s[i];
            ++i;
        }
    }
    return cp;
}

for (size_t i = 0; i < str.length(); i++)
   wcout << utf8_codepoint(str, i) << endl;

for (size_t i = 0; i < str.length(); ) {
   string cp = utf8_codepoint(str, i);
   i += cp.length();
   wcout << cp << endl;
}

Of course there are zero-width accents in Unicode that cannot be printed in solitary, but the same holds for control characters, or not having a font with full Unicode support (and hence a font of some 35 MB size).

Not `size_t`. That can be bigger than needed, and is not guaranteed to be big enough. — Deduplicator, Feb 10 '20 at 13:51
@Deduplicator `uint32`? I am unaware of the current trends. A `char32_t` I see. — Joop Eggen, Feb 10 '20 at 14:08

How to manage Unicode strings easily in C++

3 Answers3