Unable to work with utf8 character in c++

Question

#include <iostream>
#include <string>
#include <stdio.h>


using namespace std;
int main(){
    string str = "∑カ[キ…クケコ°サシÀスセÏÔÎソタ]—チツテトÃナニヌÊネノЖИѠѬѰѪᐂᑧᐫᐑᕓᕩᘷᙈᏍsᏜᎹ᳐盘的";
    cout << "--> String: " << str << endl;
    cout<<"--> Size str1: "<<str.size()<<endl;   
    
    for(unsigned ii=0; ii<=str.size();++ii)
    { 
        cout <<"--> ii: "<<ii<< " --> Character: "<< str[ii] <<endl;
    }
}

I'm using the ConEmu console with chcp 65001 setting (utf8), everything works fine when displaying the string str.

But when I'm trying to use each individual character of the string str and displaying I got a wrong display.

Can anybody tell me how to work with individual character?

Try to use `wcout` and `wstring`, and add prefix `L` before string literal? — con ko, Apr 01 '20 at 16:20
@JohnDing Please absolutely don’t. `std::wstring` should be relegated to history and never touched except to interface with legacy API. — Konrad Rudolph, Apr 01 '20 at 16:24
@KonradRudolph ahh, i just want to give a quick fix in the comment section. I know using `u16_string` and `u32_string` will be better. However it's hard to find suitable IO operations for them. As for `u8_string`, I don't think C++20 is an option. — con ko, Apr 01 '20 at 16:28
Also, `for(unsigned ii=0; ii<=str.size();++ii)` out of bounds. — Ashwani, Apr 01 '20 at 16:29
@KonradRudolph Of course, your suggestion is absolutely correct. — con ko, Apr 01 '20 at 16:29
@JohnDing that won't help! On Windows it's UTF-16 so you can't print characters outside the BMP if you print the individual code units — phuclv, Apr 02 '20 at 10:32

rustyx · Answer 1 · 2020-04-02T10:13:20.247

UTF-8 uses between 1 and 4 bytes to encode a single character.

So you can decode it by reading as many bytes as needed based on the value of the first byte:

0xxxxxxx - 1 byte
110xxxxx - 2 bytes
1110xxxx - 3 bytes
11110xxx - 4 bytes

(notice some gaps between these values - those are invalid UTF-8 values)

For example like this:

#include <iomanip>
#include <iostream>
#include <string>
#include <stdio.h>

using namespace std;
int main() {
    string str = "∑カ[キ…クケコ°サシÀスセÏÔÎソタ]—チツテトÃナニヌÊネノЖИѠѬѰѪᐂᑧᐫᐑᕓᕩᘷᙈᏍsᏜᎹ᳐盘的";
    cout << "--> String: " << str << endl;
    cout << "--> Size str1: " << str.size() << endl;

    string buf;
    int i = 0, count = 0;
    for (unsigned char c : str)
    {
        if (count == 0) {
            buf = c;
            if (c >= 0xF0)
                count = 3;
            else if (c >= 0xE0)
                count = 2;
            else if (c >= 0xC0)
                count = 1;
        } else {
            buf += c;
            --count;
        }
        if (count > 0)
            continue;
        cout << "--> ii: " << i++ << " --> Character: " << buf;
        cout << "  UTF-8 bytes:";
        for (unsigned char b : buf) {
            cout << " " << uppercase << hex << setfill('0') << setw(2) << (int)b;
        }
        cout << endl;
    }
}

Output:

--> String: ∑カ[キ…クケコ°サシÀスセÏÔÎソタ]—チツテトÃナニヌÊネノЖИѠѬѰѪᐂᑧᐫᐑᕓᕩᘷᙈᏍsᏜᎹ᳐盘的
--> Size str1: 140
--> ii: 0 --> Character: ∑  UTF-8 bytes: E2 88 91
--> ii: 1 --> Character: カ  UTF-8 bytes: E3 82 AB
--> ii: 2 --> Character: [  UTF-8 bytes: 5B
--> ii: 3 --> Character: キ  UTF-8 bytes: E3 82 AD
--> ii: 4 --> Character: …  UTF-8 bytes: E2 80 A6
--> ii: 5 --> Character: ク  UTF-8 bytes: E3 82 AF
--> ii: 6 --> Character: ケ  UTF-8 bytes: E3 82 B1
--> ii: 7 --> Character: コ  UTF-8 bytes: E3 82 B3
--> ii: 8 --> Character: °  UTF-8 bytes: C2 B0
--> ii: 9 --> Character: サ  UTF-8 bytes: E3 82 B5
--> ii: A --> Character: シ  UTF-8 bytes: E3 82 B7
--> ii: B --> Character: À  UTF-8 bytes: C3 80
--> ii: C --> Character: ス  UTF-8 bytes: E3 82 B9
--> ii: D --> Character: セ  UTF-8 bytes: E3 82 BB
--> ii: E --> Character: Ï  UTF-8 bytes: C3 8F
--> ii: F --> Character: Ô  UTF-8 bytes: C3 94
--> ii: 10 --> Character: Î  UTF-8 bytes: C3 8E
--> ii: 11 --> Character: ソ  UTF-8 bytes: E3 82 BD
--> ii: 12 --> Character: タ  UTF-8 bytes: E3 82 BF
--> ii: 13 --> Character: ]  UTF-8 bytes: 5D
--> ii: 14 --> Character: —  UTF-8 bytes: E2 80 94
--> ii: 15 --> Character: チ  UTF-8 bytes: E3 83 81
--> ii: 16 --> Character: ツ  UTF-8 bytes: E3 83 84
--> ii: 17 --> Character: テ  UTF-8 bytes: E3 83 86
--> ii: 18 --> Character: ト  UTF-8 bytes: E3 83 88
--> ii: 19 --> Character: Ã  UTF-8 bytes: C3 83
--> ii: 1A --> Character: ナ  UTF-8 bytes: E3 83 8A
--> ii: 1B --> Character: ニ  UTF-8 bytes: E3 83 8B
--> ii: 1C --> Character: ヌ  UTF-8 bytes: E3 83 8C
--> ii: 1D --> Character: Ê  UTF-8 bytes: C3 8A
--> ii: 1E --> Character: ネ  UTF-8 bytes: E3 83 8D
--> ii: 1F --> Character: ノ  UTF-8 bytes: E3 83 8E
--> ii: 20 --> Character: Ж  UTF-8 bytes: D0 96
--> ii: 21 --> Character: И  UTF-8 bytes: D0 98
--> ii: 22 --> Character: Ѡ  UTF-8 bytes: D1 A0
--> ii: 23 --> Character: Ѭ  UTF-8 bytes: D1 AC
--> ii: 24 --> Character: Ѱ  UTF-8 bytes: D1 B0
--> ii: 25 --> Character: Ѫ  UTF-8 bytes: D1 AA
--> ii: 26 --> Character: ᐂ  UTF-8 bytes: E1 90 82
--> ii: 27 --> Character: ᑧ  UTF-8 bytes: E1 91 A7
--> ii: 28 --> Character: ᐫ  UTF-8 bytes: E1 90 AB
--> ii: 29 --> Character: ᐑ  UTF-8 bytes: E1 90 91
--> ii: 2A --> Character: ᕓ  UTF-8 bytes: E1 95 93
--> ii: 2B --> Character: ᕩ  UTF-8 bytes: E1 95 A9
--> ii: 2C --> Character: ᘷ  UTF-8 bytes: E1 98 B7
--> ii: 2D --> Character: ᙈ  UTF-8 bytes: E1 99 88
--> ii: 2E --> Character: Ꮝ  UTF-8 bytes: E1 8F 8D
--> ii: 2F --> Character: s  UTF-8 bytes: 73
--> ii: 30 --> Character: Ꮬ  UTF-8 bytes: E1 8F 9C
--> ii: 31 --> Character: Ꮉ  UTF-8 bytes: E1 8E B9
--> ii: 32 --> Character: ᳐  UTF-8 bytes: E1 B3 90
--> ii: 33 --> Character: 盘  UTF-8 bytes: E7 9B 98
--> ii: 34 --> Character: 的  UTF-8 bytes: E7 9A 84

As you can see, each UTF-8 code point in the string is encoded using 1, 2 or 3 bytes (note that the char data type usually contains just 1 byte).

This can be inconvenient if you want to work with individual Unicode symbols as a unit. In this case you can convert the string to a wstring and work with wide-char type (wchar_t) instead of char.

See the following link to the question about how to convert a string to a wstring.

It's working for display, but do you think it will work by using the find method ? like finding a character in the str ? — Gilles06, Apr 01 '20 at 17:16
Yes, you can search for a substring consisting of bytes of a single Unicode code point. This works because UTF-8 is self-synchronizing i.e. it's impossible to find a code point at the middle of another code point. What you cannot do is select Unicode characters randomly by offset, you can only do that sequentially since the position of a UTF-8 character depends on the size of all previous characters in the string. — rustyx, Apr 01 '20 at 18:22
Thks. But still do not understand why when using "shuffle (str.begin(), str.end(), default_random_engine(seed));." and display the str after I have a wrong display ? Isthe byte value is affected by the shuffle ? — Gilles06, Apr 02 '20 at 09:45
A UTF-8 encoded symbol consists of multiple bytes. shuffle breaks that because it moves individual bytes around. See my updated answer for more details. — rustyx, Apr 02 '20 at 10:13
Thks rustyx. Does that mean c++ is not able to manipulate Unicode easily ? Python 3.4 is doing it — Gilles06, Apr 03 '20 at 11:11
rustyx: If I understood well your solution, I have to isolate and work with substring at bytes level of each character ? If so, it's very complicated. — Gilles06, Apr 03 '20 at 13:40
Yes, C++ is much more low-level than Python and requires full understanding of how strings are represented in memory. But as I said, consider trying out `wstring`, it can simplify working with Unicode. — rustyx, Apr 04 '20 at 12:12

score 0 · Answer 2 · answered Apr 01 '20 at 16:25

0

Does anybody tell me how to work with individual character ?

By following the Unicode specification.

Individual char objects in C++ correspond to a code unit of unicode. Interleaving other code units in between separate code units of a single character will break the encoding.

There is no standard C++ function to iterate unicode characters.

answered Apr 01 '20 at 16:25

eerorika

232,697
12
197
326

Note that C++20 has a `char8_t` type now. It doesn't help in any way other than the fact that it is always unsigned by default (in all compilers) which makes it a little easier to work with than `char`. But since all existing functions expect `char`... it's not what I'd call practical. – Alexis Wilke May 28 '23 at 16:16

Alexis Wilke · Answer 3 · 2023-05-28T16:14:41.427

rustyx as the right answer, presenting how UTF-8 characters are encoded.

This is definitely not trivial, however, if you use a library, it can become pretty easy to work with UTF-8. You just have to remember that most characters are not encoded using 8 bits (actually, only 128 characters fit inside 8 bits, all the others use 2 to 4 bytes, for a total of 1,112,064¹ possible characters).

Note that the UTF-8 encoding scheme actually supports 1 to 7 bytes, but the Unicode characters are limited to a number between 0 and 0x10FFFF inclusive. This is why only 4 bytes are required. (In the old days, there was no such restrictions.)

So on my end, I wrote the libutf8 library, which has the ability to convert UTF-8 to UTF-16 and UTF-32 and vice versa. It also includes an iterator allowing you to iterate through a UTF-8 string one character at a time. You can read the character as a char32_t value which supports any Unicode character.

Here is an example:

std::string s = "some string...";

for(libutf8::utf8_iterator it(s); it != s.end(); ++it)
{
    char32_t c(*it);

    // here you can choose:
    if(c == libutf8::NOT_A_CHARACTER)
    {
        // handle error -- current character is not valid UTF-8
        break;
    }
    // -- or --
    if(it.bad())
    {
        // handle error -- current character is not valid UTF-8
        break;
    }

    // 'c' is valid, you can print it, etc.
    ...
}

I also offer a reverse iterator.

The library also has other functions such as the u8length() to compute the length of the UTF-8 string in characters (instead of the strlen() which counts the bytes).

Note 1: Since C++20, the compiler includes the char8_t type. This is distinct from the char type. It is always unsigned by default (contrary to char which some compiler view as signed by default) but it is otherwise just a byte. In other words, it still requires you to know how to encode/decode UTF-8 properly.

Note 2: The C library offers many of these functions, which work with any type of multi-byte encoding... meaning that if your console (locale) is not set to UTF-8, you are likely to not get the correct results. This is why I'd rather have my own library and use that and not rely on a parameter the user can easily mess up. See for example mblen(3), mbrtowc(3), wcstombs(3), etc.

¹ The number 1,112,064 comes from (0x110000 - 0x800). The 0x800 comes from the UTF-16 surrogates, code bytes 0xD800 to 0xDFFF. The surrogates are only valid in UTF-16 and are used to encode characters from 0x10000 to 0x10FFFF. These code bytes are invalid in UTF-8 and UTF-32. Note further that all characters that end with 0xXXFFFE and 0xXXFFFF are not considered valid either. However, they can safely be encoded in UTF-8 and UTF-32.

Unable to work with utf8 character in c++

3 Answers3

Linked