A u8string
is effectively a sequence of bytes as far as most C++ functions are concerned. As such size()
gives you 13 (48 65 6c 6c 6f f0 9f 98 83 f0 9f 98 83
). The "" ("SMILING FACE WITH OPEN MOUTH" U+1F603) being encoded as 4 elements f0 9f 98 83
. You will see this with [i]
, substr
, etc. as well.
Knowing that it is UTF-8, you can count the number of Unicode code points. You could use a u32string
which is codepoints. I don't believe C++ has functions to do so directly on a u8string
out of the box:
size_t count_codepoints(const std::u8string &str)
{
size_t count = 0;
for (auto &c : str)
if ((c & 0b1100'0000) != 0b1000'0000) // Not a trailing byte
++count;
return count;
}
However this is still maybe not what people think of as "number of character". This is because multiple codepoints might be used to represent a single visible character, the "combining characters". Some of these also have "precomposed" forms, and the order of the combining codepoints can vary, leading to the "normal forms" and issues with comparing Unicode strings. For example "Á" might be "LATIN CAPITAL LETTER A WITH ACUTE' (U+00C1)" which is UTF-8 C3 81
, or it might have a normal "A" with a "COMBINING ACUTE ACCENT (U+0301)" which is two code points and 3 UTF-8 bytes 41 CC 81
.
There are tables for each Unicode version from unicode.org that let you properly handle and convert the combining characters (and things like upper/lower case conversion) but they are pretty extensive and you would need to write some code to handle them. 3rd party libraries (I think Linux mostly uses ICU) or OS functions (Window's has a bunch of API's) also provide various utilities.
It's worth noting you can run into these issues in many other cases/languages not just C++. e.g. JavaScript, Java and .NET, along with the Windows C/C++ API (essentially wchar_t
on Windows) use UTF-16 strings which has "surrogate pairs" for some codepoints with many functions actually counting UTF-16 elements, not codepoints.