The problem is that utf-8 (not unicode) is a multi byte character encoding. Most common characters (the ansi character set) only use one single byte, but less common ones (notably emoticons) can use up to 4. But that is far from being the only problem.
If you only use characters from the Basic Multilingual Plane, and can be sure to never encounter combining ones, you can safely use std::wstring
and wchar_t
, because wchar_t
is guaranteed to contain any characters from the BMP.
But in the generic case, Unicode is a mess. Even when using char32_t
which can contain any unicode code point, you cannot be sure to have a bijection between unicode code points and graphemes (displayed characters). For example the LATIN SMALL LETTER E WITH ACUTE (é
) is the Unicode character U+E9. But it can be represented in a decomposed form as U+65 U+0301, or LATIN SMALL LETTER E followed with a COMBINING ACUTE ACCENT. So even when using char32_t
, you get 2 characters for one single grapheme, and it would be incorrect to split them:
wchar32_t eaccute = { 'e', 0x301, 0};
This is indeed a representation of é
. You can copy and paste it to control that it is not the U+E9 character, but the decomposed one, but in printed form there cannot be any difference.
TL/DR: Except if you are sure to only use a subset of the Unicode charset that could be represented in a much shorter charset as ISO-8859-1 (Latin1), or equivalent, you have no simple way to know how to split a string in true characters.