How to measure the correct size of non-ASCII characters?

Question

In the following program, I'm trying to measure the length of a string with non-ASCII characters.

But, I'm not sure why the size() doesn't print the correct length when using non-ASCII characters.

#include <iostream>
#include <string>

int main()
{
    std::string s1 = "Hello";
    std::string s2 = "इंडिया"; // non-ASCII string
    std::cout << "Size of " << s1 << " is " << s1.size() << std::endl;
    std::cout << "Size of " << s2 << " is " << s2.size() << std::endl;
}

Output:

Size of Hello is 5
Size of इंडिया is 18

Live demo Wandbox.

length of utf8 string is ambiguous (code points (with which representation), grapheme clusters, bytes). See http://utf8everywhere.org/#characters and http://utf8everywhere.org/#myth.strlen — Jarod42, Oct 26 '17 at 09:20
If you set your documents codepage to utf8(most common ) in your editor, then every character will be coded between 1 to 4 byte,then sizeof,strlen,etc. does not worked correctly for non ASCI characters while these are designed to work on ASCI characters,which coded in 7bit( less than 0x7F) You can check below linkfor counting number of characters. http://www.fileformat.info/info/unicode/utf8.htm — Mahmoud Hosseinipour, Oct 26 '17 at 14:46

cbuchart · Answer 1 · 2017-10-26T07:17:03.593

std::string::size returns the length in bytes, not in number of characters. Your second string uses an UNICODE encoding, so it may take several bytes per character. Note that the same applies to std::wstring::size since it will depend on the encoding (it returns the number of wide-chars, not actual characters: if UTF-16 is used it will match but not necessarily for other encodings, more in this answer).

To measure the actual length (in number of symbols) you need to know the encoding in order to separate (and therefore count) the characters correctly. This answer may be helpful for UTF-8 for example (although the method used is deprecated in C++17).

Another option for UTF-8 is to count the number of first-bytes (credit to this other answer):

int utf8_length(const std::string& s) {
  int len = 0;
  for (auto c : s)
      len += (c & 0xc0) != 0x80;
  return len;
}

Note that code_point count might be different than abstract character count. — Jarod42, Oct 26 '17 at 09:29
As a literal string, the second string uses the execution character encoding that the compiler was directed to emit ([-fexec-charset](https://gcc.gnu.org/onlinedocs/cpp/Invocation.html) or equivalent), with, yes, a likely default of UTF-8. — Tom Blodget, Oct 26 '17 at 16:26

msc · Accepted Answer · 2017-10-26T09:51:42.507

I have used std::wstring_convert class and got the correct length of the strings.

#include <string>
#include <iostream>
#include <codecvt>

int main()
{
    std::string s1 = "Hello";
    std::string s2 = "इंडिया"; // non-ASCII string
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cn;
    auto sz = cn.from_bytes(s2).size();
    std::cout << "Size of " << s2 << " is " << sz << std::endl;
}

Live demo wandbox.

Importance reference link here for more about std::wstring_convert

How to measure the correct size of non-ASCII characters?

2 Answers2