rustyx as the right answer, presenting how UTF-8 characters are encoded.
This is definitely not trivial, however, if you use a library, it can become pretty easy to work with UTF-8. You just have to remember that most characters are not encoded using 8 bits (actually, only 128 characters fit inside 8 bits, all the others use 2 to 4 bytes, for a total of 1,112,064¹ possible characters).
Note that the UTF-8 encoding scheme actually supports 1 to 7 bytes, but the Unicode characters are limited to a number between 0 and 0x10FFFF inclusive. This is why only 4 bytes are required. (In the old days, there was no such restrictions.)
So on my end, I wrote the libutf8 library, which has the ability to convert UTF-8 to UTF-16 and UTF-32 and vice versa. It also includes an iterator allowing you to iterate through a UTF-8 string one character at a time. You can read the character as a char32_t
value which supports any Unicode character.
Here is an example:
std::string s = "some string...";
for(libutf8::utf8_iterator it(s); it != s.end(); ++it)
{
char32_t c(*it);
// here you can choose:
if(c == libutf8::NOT_A_CHARACTER)
{
// handle error -- current character is not valid UTF-8
break;
}
// -- or --
if(it.bad())
{
// handle error -- current character is not valid UTF-8
break;
}
// 'c' is valid, you can print it, etc.
...
}
I also offer a reverse iterator.
The library also has other functions such as the u8length()
to compute the length of the UTF-8 string in characters (instead of the strlen()
which counts the bytes).
Note 1: Since C++20, the compiler includes the char8_t
type. This is distinct from the char
type. It is always unsigned
by default (contrary to char
which some compiler view as signed
by default) but it is otherwise just a byte. In other words, it still requires you to know how to encode/decode UTF-8 properly.
Note 2: The C library offers many of these functions, which work with any type of multi-byte encoding... meaning that if your console (locale) is not set to UTF-8, you are likely to not get the correct results. This is why I'd rather have my own library and use that and not rely on a parameter the user can easily mess up. See for example mblen(3)
, mbrtowc(3)
, wcstombs(3)
, etc.
¹ The number 1,112,064 comes from (0x110000 - 0x800). The 0x800 comes from the UTF-16 surrogates, code bytes 0xD800 to 0xDFFF. The surrogates are only valid in UTF-16 and are used to encode characters from 0x10000 to 0x10FFFF. These code bytes are invalid in UTF-8 and UTF-32. Note further that all characters that end with 0xXXFFFE and 0xXXFFFF are not considered valid either. However, they can safely be encoded in UTF-8 and UTF-32.