C++ iterate utf-8 string with mixed length of characters

Question

I need to loop over a utf-8 string and get each character of the string. There might be different types of characters in the string, e.g. numbers with the length of one byte, Chinese characters with the length of three bytes, etc.

I looked at this post and it can do 80% of the job, except that when the string has 3-byte chinese characters before 1-byte numbers, it will see the numbers also as having 3 bytes and print the numbers as 1** where * is gibberish.

To give an example, if the string is '今天周五123', the result will be:

今
天
周
五
1**
2**
3**

where * is gibberish. However if the string is '123今天周五', the numbers will print out fine.

The minimally adapted code from the above mentioned post is copied here:

#include <iostream>
#include "utf8.h"

using namespace std;

int main() {    
    string text = "今天周五123";

    char* str = (char*)text.c_str();    // utf-8 string
    char* str_i = str;                  // string iterator
    char* end = str+strlen(str)+1;      // end iterator

    unsigned char symbol[5] = {0,0,0,0,0};

    cout << symbol << endl;

    do
    {
        uint32_t code = utf8::next(str_i, end); // get 32 bit code of a utf-8 symbol
        if (code == 0)
            continue;

        cout << "utf 32 code:" << code << endl;

        utf8::append(code, symbol); // initialize array `symbol`

        cout << symbol << endl;

    }
    while ( str_i < end );

    return 0;
}

Can anyone help me here? I am new to c++ and although I checked the documentation of utf8 cpp, I still have no idea where the problem is. I think the library was created to handle such issues where you have utf-8 encodings with different lengths, so there should be a way to do this... Have been struggling with this for two days...

@deviantfan yes. In fact any help would be appreciated. I read that the ICU library should also be able to handle this, but I'm too new to c++ to figure out how that works... — Hai, Oct 15 '16 at 04:05
Ok, I'll extend my answer in some minutes ... about ICU, it's way too heavyweight for something this simple. — deviantfan, Oct 15 '16 at 04:09
A useful resource when working with unicode is https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=iws-appendixa — Michael Surette, Nov 28 '18 at 16:50

score 15 · Accepted Answer · edited Jul 08 '20 at 08:58

15

Insert

memset(symbol, 0, sizeof(symbol));

before

utf8::append(code, symbol);

If this for some reason still doesn't work, or if you want to get rid of the lib, recognizing codepoints is not that complicated:

string text = "今天周五123";
for(size_t i = 0; i < text.length();)
{
    int cplen = 1;
    if((text[i] & 0xf8) == 0xf0) cplen = 4;
    else if((text[i] & 0xf0) == 0xe0) cplen = 3;
    else if((text[i] & 0xe0) == 0xc0) cplen = 2;
    if((i + cplen) > text.length()) cplen = 1;

    cout << text.substr(i, cplen) << endl;
    i += cplen;
}

With both solution, however, be aware that multi-cp glyphs exist, as well as cp's that can't be printed alone

edited Jul 08 '20 at 08:58

hiddensunset4

5,825
3
39
61

answered Oct 15 '16 at 04:01

deviantfan

11,268
3
32
49

That's great. Thanks a lot! I remember the lib documentation says that it first detects whether the it's a valid cp or not. So maybe multi-cp glyphs will raise an error.. What do you mean by cp's that can't be printed alone? What are they? – Hai Oct 15 '16 at 04:36
1

@Hai I don't think the lib can do what you think it can. If the CPs are valid or not is independent of their semantic meaning and possible combinations. ... A very simple example, the french **á** (not the same as a) can either be a normal single codepoint for á, or first a regular a and then a codepoint which adds an accent to the previous codepoint (independent of what the previous one is). ... If you get the latter variant, your code will print a regular a first, and the something strange or nothing at all, depending on the text rendering of your environment. – deviantfan Oct 15 '16 at 04:52
If you want to work with such things, you really need ICU (and it gets a lot more complicated) – deviantfan Oct 15 '16 at 04:53
Now I mostly need to handle Chinese characters, but maybe will have to deal with umlauts or other diacritics in the future.. Do you have any idea what functionalities does ICU offer? btw I used your code instead of calling the lib, since it's more concise :) – Hai Oct 15 '16 at 19:38
1

@deviantfan where did `0xf8` etc. come from? Is there any doc to share? – shellbye Aug 13 '18 at 07:03
4

@shellbye Look at the table in https://en.wikipedia.org/wiki/UTF-8#Description If you convert the hex values to bitmasks, you'll understand. Eg. the first if says if the first 5 bits are 11110 then it's a length 4 tuple. If the first 4 bits are 1110 then it's length cplen 3. And so on. – deviantfan Aug 15 '18 at 09:56
@deviantfan Thank you for the detail. – shellbye Aug 16 '18 at 10:04
When is `if((i + cplen) > text.length()) cplen = 1;` necessary...? – ghchoi Jul 30 '19 at 08:35
1

@GyuHyeonChoi It's mostly a precaution so that invalid input can't cause out-of-bounds bugs. – deviantfan Oct 15 '19 at 06:05

C++ iterate utf-8 string with mixed length of characters

1 Answers1

Linked