Loop through Unicode string as character

Question

With the following string, the size is incorrectly output. Why is this, and how can I fix it?

string str = " ██████";
cout << str.size();
// outputs 19 rather than 7

I'm trying to loop through str character by character so I can read it into into a vector<string> which should have a size of 7, but I can't do this since the above code outputs 19.

@Jean-BaptisteYunès that won't work if `wstring` uses UTF-16 and the characters are outside the BMP. It won't also work for many characters with diacritics, since [some are precomposed and some can be combined](https://en.wikipedia.org/wiki/Precomposed_character): Å = U+00C5 or U+0041 plus U+030A and without proper normalization calling `length` may return 1 or 2 and not a consistent number that one expects — phuclv, Oct 19 '19 at 15:53
@Jean-BaptisteYunès I already saw that yesterday, and it's incorrect if what the OP wants is the number of characters — phuclv, Oct 20 '19 at 12:41
@ToasterFrogs you need to explain why you expect it to `have a size of 6`. Are you interested in the **number of visible characters**? That's not what developers want in most situations. And even in that case then you must also choose one normalization, since Å can result in 1 or 2 characters depending on the viewpoint, and ‍‍‍ actually consists of 7 characters under the hood. Are you only interested in strings with █ or with any other Unicode characters? — phuclv, Oct 20 '19 at 12:56
@phuclv I mistyped. I am looking for an output of 7 - the post has been edited. I am interested in using other Unicode characters such as `▖`, `▒`, `◢`, and possibly other box drawing characters. I am looking for the number of characters in the string - in the cases you mentioned, I would like `1` to be returned. — k-a-v, Oct 20 '19 at 13:06

phuclv · Accepted Answer · 2022-06-20T13:14:21.627

TL;DR

The size() and length() members of basic_string returns the size in units of the underlying string, not the number of visible characters. To get the expected number:

Use UTF-16 with u prefix for very simple strings that contain no non-BMP, no combining characters and no joining characters
Use UTF-32 with U prefix for very simple strings that don't contain any combining or joining characters
Normalize the string and count for arbitrary Unicode strings

" ██████" is a space followed by a series of 6 U+2588 characters. Your compiler seems to be using UTF-8 for std::string. UTF-8 is a variable-length encoding and many letters are encoded using multiple bytes (because obviously you can't encode more than 256 characters with just one byte). In UTF-8 code points between U+0800 and U+FFFF are encoded by 3 bytes. Therefore the length of the the string in UTF-8 is 1 + 6*3 = 19 bytes.

You can check with any Unicode converter like this one and see that the string is encoded as 20 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88 in UTF-8, and you can also loop through each byte of your string to check

If you want the total number of visible characters in the string then it's a lot trickier and churill's solution doesn't work. Read the example in Twitter

If you use anything beyond the most basic letters, numbers, and punctuation the situation gets more confusing. While many people use multi-byte Kanji characters to exemplify these issues, Twitter has found that accented vowels cause the most confusion because English speakers simply expect them to work. Take the following example: the word “café”. It turns out there are two byte sequences that look exactly the same, but use a different number of bytes:
café  0x63 0x61 0x66 0xC3 0xA9        Using the “é” character, called the “composed character”.
café  0x63 0x61 0x66 0x65 0xCC 0x81   Using the combining diacritical, which overlaps the “e”

You need a Unicode library like ICU to normalize the string and count. Twitter for example uses Normalization Form C

Edit:

Since you're only interested in box-drawing characters which doesn't seem to lie outside the BMP and don't contain any combining characters, UTF-16 and UTF-32 will work. Like std::string, std::wstring is also a basic_string and doesn't have a mandatory encoding. In most implementations it's often either UTF-16 (Windows) or UTF-32 (*nix) so you may use it, but it's unreliable and depends on source code encoding. The better way is to use std::u16string (std::basic_string<char16_t>) and std::u32string (std::basic_string<char32_t>). They'll work regardless of system and encoding of the source file

std::wstring wstr     = L" ██████";
std::u16string u16str = u" ██████";
std::u32string u32str = U" ██████";
std::cout << str.size();    // may work, returns the number of wchar_t characters
std::cout << u16str.size(); // always returns the number of UTF-16 code units
std::cout << u32str.size(); // always returns the number of UTF-32 code units

In case you're interested in how to work out on that for all Unicode characters continue reading below

The “café” issue mentioned above raises the question of how you count the characters in the Tweet string “café”. To the human eye the length is clearly four characters. Depending on how the data is represented this could be either five or six UTF-8 bytes. Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation. Therefore, Twitter does count “café” as four characters, no matter which representation is sent.

[...]

Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text. This type of normalization favors the use of a fully combined character (0xC3 0xA9 from the café example) over the long-form version (0x65 0xCC 0x81). Twitter also counts the number of codepoints in the text rather than UTF-8 bytes. The 0xC3 0xA9 from the café example is one codepoint (U+00E9) that is encoded as two bytes in UTF-8, whereas 0x65 0xCC 0x81 is two codepoints encoded as three bytes

Twitter - Counting characters

See also

Did you make mistake "In most implementations it's often either UTF-16 (Windows) or UTF-16 (*nix)"? std::wstring and wchar_t are 32 bit in gcc and maybe in clang by default at least in *nix. — Кое Кто, Jun 20 '22 at 11:11

Lukas-T · Answer 2 · 2019-10-19T15:53:42.710

3

std::string only contains 1 byte long chars (usually 8 bit, contains UTF-8 char), you need wchar_t and std::wstring to achieve what you want:

std::wstring str = L" ██████";
std::cout << str.size();

Allthough this prints 7 (one space and 6 unicode chars). Notice the L before the string literal, so it will be interpreted as a wide string.

edited Oct 19 '19 at 15:53

answered Oct 19 '19 at 15:47

Lukas-T

11,133
3
20
30

technically `char` isn't required to be 8-bit long. Some platforms may have longer `char` – phuclv Oct 19 '19 at 15:48
1

@churill `std::string` knows nothing about UTF-8, and won't "usually" hold UTF-8 unless you actually make it hold UTF-8. Whether UTF-8 is used really depends on the platform being used, the compiler settings, the encoding of the source file, etc. At the very least, if you want a literal in UTF-8, use the `u8` prefix: `string str = u8" ██████";` but the other factors I mentioned still play into it. – Remy Lebeau Oct 19 '19 at 17:30
@RemyLebeau I see, I'm bad at finding the right terminology, feel free to edit it :) – Lukas-T Oct 19 '19 at 17:57
2

as commented above, this returns the **number of UTF-16** (or UTF-32 depending on implementation) **code units** and not the *number of characters in the string* (if that's what the OP wants, since currently it's still unclear) so it definitely won't work for strings like `"‍Å"` (it'll return 9 for UTF-16 and 6 for UTF-32) – phuclv Oct 20 '19 at 12:48

Loop through Unicode string as character

2 Answers2

TL;DR

Edit: