2

How do I get correct length of std::u8string? (in C++20) I have tried following code that print incorrect value of length which may returns value of number of codepoint.

How I can get correct value which I expected 7 that number of character?

int main() {
    const char8_t* s = u8"Hello";
    auto st = std::u8string(s);
    std::cout << st.size() << std::endl;
}
KiYugadgeter
  • 3,796
  • 7
  • 34
  • 74
  • I think both `size()` and `length()` for a `std::u8string` will return the number of *code points* in the string, rather than the number of printed characters. You could, perhaps, try converting to a `std::u32string` to be sure all multi-byte codes are counted only as single characters. – Adrian Mole Jan 11 '20 at 05:05
  • 1
    What do you plan to use this number for, once obtained? Be aware that, in the presence of combining diacritics and ligatures, the number of codepoints may not correspond to the number of graphemes (display units that a human would think of as a "character"). – Igor Tandetnik Jan 11 '20 at 05:12
  • 1
    I want a number of display units. – KiYugadgeter Jan 11 '20 at 05:16
  • 1
    @KiYugadgeter: Note that the number of Unicode codepoints is *not* equal to the "display units" of a string. That requires complex text layout, which is an even more complicated computation. – Nicol Bolas Jan 11 '20 at 05:41
  • 1
    @AdrianMole "*I think both `size()` and `length()` for a `std::u8string` will return the number of" code points" in the string*" - no, they return the number of encoded" code units". "*You could, perhaps, try converting to a `std::u32string` to be sure all multi-byte codes are counted only as single "characters".*" - what you refer to as" characters" are, in fact, "codepoints". And what you see visually is groups of codepoints, known as "grapheme clusters". See [What's the difference between a character, a code point, a glyph and a grapheme?](https://stackoverflow.com/questions/27331819/) – Remy Lebeau Jan 11 '20 at 07:13
  • @RemyLebeau Thanks for the clarification - I wasn't sure exactly what the correct terminologies were. Now I know. – Adrian Mole Jan 11 '20 at 07:16

3 Answers3

7

A u8string is effectively a sequence of bytes as far as most C++ functions are concerned. As such size() gives you 13 (48 65 6c 6c 6f f0 9f 98 83 f0 9f 98 83). The "" ("SMILING FACE WITH OPEN MOUTH" U+1F603) being encoded as 4 elements f0 9f 98 83. You will see this with [i], substr, etc. as well.

Knowing that it is UTF-8, you can count the number of Unicode code points. You could use a u32string which is codepoints. I don't believe C++ has functions to do so directly on a u8string out of the box:

size_t count_codepoints(const std::u8string &str)
{
    size_t count = 0;
    for (auto &c : str)
        if ((c & 0b1100'0000) != 0b1000'0000) // Not a trailing byte
            ++count;
    return count;
}

However this is still maybe not what people think of as "number of character". This is because multiple codepoints might be used to represent a single visible character, the "combining characters". Some of these also have "precomposed" forms, and the order of the combining codepoints can vary, leading to the "normal forms" and issues with comparing Unicode strings. For example "Á" might be "LATIN CAPITAL LETTER A WITH ACUTE' (U+00C1)" which is UTF-8 C3 81, or it might have a normal "A" with a "COMBINING ACUTE ACCENT (U+0301)" which is two code points and 3 UTF-8 bytes 41 CC 81.

There are tables for each Unicode version from unicode.org that let you properly handle and convert the combining characters (and things like upper/lower case conversion) but they are pretty extensive and you would need to write some code to handle them. 3rd party libraries (I think Linux mostly uses ICU) or OS functions (Window's has a bunch of API's) also provide various utilities.

It's worth noting you can run into these issues in many other cases/languages not just C++. e.g. JavaScript, Java and .NET, along with the Windows C/C++ API (essentially wchar_t on Windows) use UTF-16 strings which has "surrogate pairs" for some codepoints with many functions actually counting UTF-16 elements, not codepoints.

Fire Lancer
  • 29,364
  • 31
  • 116
  • 182
  • [cppreference.com string literal](https://en.cppreference.com/w/cpp/language/string_literal) provides a good discussion that you have covered well. – David C. Rankin Jan 11 '20 at 05:37
2

A standard c++ answer is to transform the string from utf8 to utf32 and then check the size.

Alarmingly, std::wstring_convert is now deprecated as of c++17. I have no idea what the replacement will be.

#include <string>
#include <iostream>
#include <cstdlib>
#include <locale>
#include <codecvt>

auto convert(std::u8string input) -> std::u32string
{
    auto first = reinterpret_cast<const char*>(input.data());
    auto last = first + input.size();

    auto result = std::u32string();

    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> ucs4conv;
    try
    {
        result = ucs4conv.from_bytes(first, last);
    }
    catch(const std::range_error& e) {
        last = first + ucs4conv.converted();
        std::clog << "UCS4 failed after consuming " << std::dec << std::distance(first, last) <<" characters:\n";
        result = ucs4conv.from_bytes(first, last);
    }

    return result;
}

int main() {
    const char8_t* s = u8"Hello";
    auto st = std::u8string(s);
    std::cout << "bytes      : " << st.size() << std::endl;

    auto ws = convert(st);
    std::cout << "wide chars : " << ws.size() << std::endl;
}

expected output:

bytes      : 13
wide chars : 7

https://godbolt.org/z/Z0a6bb

Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
1

Other answers have already suggested ways to compute the number of code points if that is really what you need for your use case. I'm adding this answer to make the point that code point length is probably not what you want.

And actually, I'm not going to make the point myself. Instead, I'm just going to provide a link to an excellent blog post that explains the issues so that you can evaluate what information you actually need.

https://hsivonen.fi/string-length

Tom Honermann
  • 1,774
  • 1
  • 7
  • 10