It is not possible to count "characters" in a Unicode string with anything in the C++ standard library in general. It isn't clear what exactly you mean with "character" to begin with and the closest you can get is counting code points by using UTF-32 literals and std::u32string
. However, that isn't going to match what you want even for їa
.
For example ї
may be a single code point
ї CYRILLIC SMALL LETTER YI' (U+0457)
or two consecutive code points
і CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I (U+0456)
◌̈ COMBINING DIAERESIS (U+0308)
If you don't know that the string is normalized, then you can't distinguish the two with the standard library and there is no way to force normalization either. Even for UTF-32 string literals it is up to the implementation which one is chosen. You will get 2 or 3 for a string їa
when counting code points.
And that isn't even considering the encoding issue that you mention in your question. Each code point itself may be encoded into multiple code units depending on the chosen encoding and .size()
is counting code units, not code points. With std::u32string
these two will at least coincide, even if it doesn't help you as I demonstrate above.
You need some unicode library like ICU if you want to do this properly.