22

I'm having some trouble figuring out the exact semantics of std::string.length(). The documentation explicitly points out that length() returns the number of characters in the string and not the number of bytes. I was wondering in which cases this actually makes a difference.

In particular, is this only relevant to non-char instantiations of std::basic_string<> or can I also get into trouble when storing UTF-8 strings with multi-byte characters? Does the standard allow for length() to be UTF8-aware?

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
ComicSansMS
  • 51,484
  • 14
  • 155
  • 166
  • there is wstring for UTF and there it makes senses that length returns the number of characters since the character's size could vary. – AndersK Oct 12 '11 at 16:33
  • 8
    @AndersK.: No, `wchar_t` has a fixed size like any other type. It can't magically vary. – Lightness Races in Orbit Oct 12 '11 at 16:33
  • Also check this lovely thread about `std::string` vs. `std::wstring` and some stuff about Unicode: http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring – wkl Oct 12 '11 at 16:35
  • 2
    @AndersK.: `wstring` has nothing to do with UTF. Perhaps you were thinking of `u16string` or `u32string`? – Kerrek SB Oct 12 '11 at 17:27

4 Answers4

31

When dealing with non-char instantiations of std::basic_string<>, sure, length may not equal number of bytes. This is particularly evident with std::wstring:

std::wstring ws = L"hi";
cout << ws.length();     // <-- 2, not 4

But std::string is about char characters; there is no such thing as a multi-byte character as far as std::string is concerned, whether you crammed one in at a high level or not. So, std::string.length() is always the number of bytes represented by the string. Note that if you're cramming multibyte "characters" into an std::string, then your definition of "character" suddenly becomes at odds with that of the container and of the standard.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • 1
    That makes perfect sense. I simply got confused by the wording in the documentation here. Thanks for clearing things up. – ComicSansMS Oct 12 '11 at 16:40
  • @ComicSansMS: Not a problem :) – Lightness Races in Orbit Oct 12 '11 at 16:41
  • *But `std::string` is about `char` characters*, so the definition of "character" in C++ is "element of some string type", rather than "what a human sees, encoded" or "a unicode codepoint, encoded somehow". This sounds believable, but can anyone quote chapter-and-verse on this? – Adrian Ratnapala Jan 08 '12 at 10:14
  • @AdrianRatnapala: It's less that the standard says it doesn't care about encodings, and more about it not saying that it does. Still, `2.3/1` might be of interest - it defines the "basic character set". And `2.3/3` says: `The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.` – Lightness Races in Orbit Jan 08 '12 at 17:18
  • Well, I guess that's what I get for asking for chapter-and-verse. – Adrian Ratnapala Jan 08 '12 at 18:02
  • 1
    @AdrianRatnapala: Yes, when asking for chapter-and-verse, you get chapter-and-verse. Anything else I can help you with? :) – Lightness Races in Orbit Jan 08 '12 at 18:24
12

If we are talking specifically about std::string, then length() does return the number of bytes.

This is because a std::string is a basic_string of chars, and the C++ Standard defines the size of one char to be exactly one byte.

Note that the Standard doesn't say how many bits are in a byte, but that's another story entirely and you probably don't care.

EDIT: The Standard does say that an implementation shall provide a definition for CHAR_BIT which says how many bits are in a byte.

By the way, if you go down a road where you do care how many bits are in a byte, you might consider reading this.

Community
  • 1
  • 1
John Dibling
  • 99,718
  • 31
  • 186
  • 324
4

A std::string is std::basic_string<char>, so s.length() * sizeof(char) = byte length. Also, std::string knows nothing of UTF-8, so you're going to get the byte size even if that's not really what you're after.

If you have UTF-8 data in a std::string, you'll need to use something else such as ICU to get the "real" length.

NuSkooler
  • 5,391
  • 1
  • 34
  • 58
0

cplusplus.com is not "the documentation" for std::string, it's a poor quality site full of poor quality information. The C++ standard defines it very clearly:

  • 21.1 [strings.general] ¶1

    This Clause describes components for manipulating sequences of any non-array POD (3.9) type. In this Clause such types are called char-like types, and objects of char-like types are called char-like objects or simply characters.

  • 21.4.4 [string.capacity] ¶1

    size_type size() const noexcept;
    Returns: A count of the number of char-like objects currently in the string.
    Complexity: constant time.

    size_type length() const noexcept;
    Returns: size()

Jonathan Wakely
  • 166,810
  • 27
  • 341
  • 521