3

Can I count the number of 'characters that a std::string' contains and not the number of bytes? For instance, std::string::size and std::string::length return the number of bytes (chars):

std::string m_string1 {"a"};
// This is 1
m_string1.size();

std::string m_string2 {"їa"};
// This is 3 because of Unicode
m_string2.size();

Is there a way to get the number of characters? For instance to obtain thet m_string2 has 2 characters.

jcjuarez
  • 81
  • 7
  • 1
    Unfortunately, this is one part of the C++ library which is deficient and/or is cumbersome to use. One needs to use `std::locale` to convert the string to a `std::wstring`, and then roll the dice again. – Sam Varshavchik Feb 19 '23 at 21:30
  • 6
    @SamVarshavchik: wstring wouldn't help either, as what a "character" is depends on what you mean by that. Multiple codepoints can form a single "character". – Nicol Bolas Feb 19 '23 at 21:31
  • You need a parser that can parse the actual encoding (probably UTF-8) and give you the code points. – Some programmer dude Feb 19 '23 at 21:33
  • It could, if Unicode characters include combining marks. Hence "roll the dice" is how I hedged my bets. – Sam Varshavchik Feb 19 '23 at 21:34
  • Possible duplicate https://stackoverflow.com/questions/43302279/any-good-solutions-for-c-string-code-point-and-code-unit/43302460#43302460 – Galik Feb 19 '23 at 21:41
  • Sometimes, it seems C++ isn't even aware that Unicode exists. You will most likely need an external library. – Etienne de Martel Feb 19 '23 at 21:55
  • I think to just quickly count code points in UTF-8, you could count the values where the `unsigned char` value `c` satisfies `(c < 0x80) || (c >= 0xc0)`. But as noted in an answer, "code point" might still not be what you mean by "character". – aschepler Feb 19 '23 at 22:43
  • There are many ways to define what a "character" is. Why do you need to count characters? Perhaps knowing this will help you select the relevant definition and then an appropriate method of counting. – n. m. could be an AI Feb 19 '23 at 22:55

1 Answers1

5

It is not possible to count "characters" in a Unicode string with anything in the C++ standard library in general. It isn't clear what exactly you mean with "character" to begin with and the closest you can get is counting code points by using UTF-32 literals and std::u32string. However, that isn't going to match what you want even for їa.

For example ї may be a single code point

ї CYRILLIC SMALL LETTER YI' (U+0457)

or two consecutive code points

і CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I (U+0456)
◌̈ COMBINING DIAERESIS (U+0308)

If you don't know that the string is normalized, then you can't distinguish the two with the standard library and there is no way to force normalization either. Even for UTF-32 string literals it is up to the implementation which one is chosen. You will get 2 or 3 for a string їa when counting code points.

And that isn't even considering the encoding issue that you mention in your question. Each code point itself may be encoded into multiple code units depending on the chosen encoding and .size() is counting code units, not code points. With std::u32string these two will at least coincide, even if it doesn't help you as I demonstrate above.

You need some unicode library like ICU if you want to do this properly.

user17732522
  • 53,019
  • 2
  • 56
  • 105