As explained in the comments, the length will return the number of bytes of your string which is encoded in utf8. In this multibyte encoding, non ascii chars are encoded on 2 to 6 bytes, so that your utf8 string length will appear longer than the real number of unicode letters.
Solution 1
If you have many long strings, you can keep them in utf8. The utf8 encoding makes it relatively easy to find out the additional multibyte characters: they a all start with 10xxxxxx in binary. So count the number of such additional bytes, and substract this from the string length
cout << "Bytes: " << s.length() << endl;
cout << "Unicode length " << (s.length() - count_if(s.begin(), s.end(), [](char c)->bool { return (c & 0xC0) == 0x80; })) << endl;
Solution 2
If more processing is needed than just counting the length, you could think of using wstring_convert::from_bytes()
in the standard library to convert your string into a wstring. The length of the wstring should be what you expect.
wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cv;
wstring w = cv.from_bytes(s);
cout << "Unicode length " << w.length() << endl;
Attention: wstring
on linux is based on 32 bits wchar_t
and one such wide char can contain all the unicode characeter set. So this is perfect. On windows however, wchar_t
is only 16 bits, so some characters might still require multi-word encoding. Fortunately, all the hindi characters are in the range U+0000 to U+D7FF which can be encoded on one 16 bit word. So it should be ok also .