C++ UTF-8 actual string length

Question

Is there any native (cross platform) C++ function in any of standard libraries which returns the actual length of std::string?

Update: as we know std::string.length() returns the number of bytes not the number of characters. I already have a custom function which returns the actual one, but I'm looking for an standard one.

You may find interesting answers [here](http://stackoverflow.com/questions/4063146/getting-the-actual-length-of-a-utf-8-encoded-stdstring). Note however that, as @BenVoigt pointed it out, C++11 now has standard ways to achieve it. — syam, May 31 '13 at 19:02
@syam: Agreed, but the answer has changed in the last 3 years. Also, this question specifically asks for functions provided by the Standard, not a custom implementation, which is all you find in the answers to the other question. — Ben Voigt, May 31 '13 at 19:03
@BenVoigt: I just saw your answer, didn't know this was part of C++11. Still, the answers to that other question may still be interesting, I'll just reword my comment. — syam, May 31 '13 at 19:05
@BenVoigt Has the answer changed that much in the last three years. Compilers are now _required_ to provide the Unicode facets, but the facet interface is unchanged (even to the point of still ignoring `std::string`, and using `charT*`)? — James Kanze, May 31 '13 at 19:28
"character" is a bit ambiguous. Depending on what you really want, the answer will become more complex. Do you want a count of Unicode code points? A count of grapheme clusters ("visible" glyphs which include combining characters, that we'd consider a single character when reading on screen)? What about invisible code points like the zero-width space? — Cory Nelson, May 31 '13 at 20:14
@James: I can't find `codecvt_utf8` or `wstring_convert` anywhere in ISO/IEC 14882:2003(E). It's newly standardized, even if the foundations were laid before, the actual functionality wasn't portable (unlike, say, parts of the Standard which are called out as optional but have well-defined behavior when they exist). — Ben Voigt, May 31 '13 at 20:46
@BenVoigt The requirement to support UTF8 is new to C++11. But the basic `codecvt` facet isn't, and it hasn't been significantly changed:-(. — James Kanze, Jun 01 '13 at 14:45
@James: Yes, the foundation was there, but there was no portable way to access it even on systems which did provide a subclass for UTF-8. — Ben Voigt, Jun 01 '13 at 14:47
@BenVoigt Yes. My point is more that the interface is still the old, awkward C++03 interface, with `charT*`, no iterators and no support for `std::string`. — James Kanze, Jun 01 '13 at 15:02
@James: Ahh, now I get it. You're not saying that C++03 was mostly there, but that C++11 support is still archaic. Yes, I agree, this interface is very unfriendly. — Ben Voigt, Jun 01 '13 at 15:29

Ben Voigt · Answer 1 · 2013-05-31T19:15:19.650

6

codecvt ought to be helpful, the Standard provides implementations for UTF-8, for example codecvt_utf8<char32_t>() would be appropriate in this case.

Probably something like:

wstring_convert< codecvt_utf8<char32_t>, char32_t >().from_bytes(the_std_string).size()

edited May 31 '13 at 19:15

answered May 31 '13 at 19:02

Ben Voigt

277,958
43
419
720

score 1 · Answer 2 · answered Jun 01 '13 at 23:48

1

Actual length is the number of bytes. There is very little meaning to counting codepoints. You may though want to count other things like grapheme clusters.

See more about different kind of string lengths in http://utf8everywhere.org

answered Jun 01 '13 at 23:48

Pavel Radzivilovsky

18,794
5
57
67

score 1 · Answer 3 · answered Jun 03 '13 at 17:33

There is no way to do that in C/C++, without 3rd party libraries. Even if you convert to char32_t, you will get code points, not characters.

A code point does not match the user perception of a character, because of things like decompose formats, ligatures, variation selectors.

The closest available construct to a "user character" is a "grapheme cluster" (see http://www.unicode.org/reports/tr29/)

Your best cross-platform option is ICU4C (http://site.icu-project.org/)

C++ UTF-8 actual string length

3 Answers3

Linked