27

In C++11, the characters of a std::string have to be stored contiguously, as § 21.4.1/5 points out:

The char-like objects in a basic_string object shall be stored contiguously. That is, for any basic_string object s, the identity &*(s.begin() + n) == &*s.begin() + n shall hold for all values of n such that 0 <= n < s.size().

However, here is how § 21.4.7.1 lists the two functions to retrieve a pointer to the underlying storage (emphasis mine):

const charT* c_str() const noexcept;
const charT* data() const noexcept;
1 Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
2 Complexity: constant time.
3 Requires: The program shall not alter any of the values stored in the character array.

One possibility I can think of for point number 3 is that the pointer can become invalidated by the following uses of the object (§ 21.4.1/6):

  • as an argument to any standard library function taking a reference to non-const basic_string as an argument.
  • Calling non-const member functions, except operator[], at, front, back, begin, rbegin, end, and rend.

Even so, iterators can become invalidated, but we can still modify them regardless until they do. We can still use the pointer until it becomes invalidated to read from the buffer as well.

Why can't we write directly to this buffer? Is it because it would put the class in an inconsistent state, as, for example, end() would not be updated with the new end? If so, why is it permitted to write directly to the buffer of something like std::vector?

Use cases for this include being able to pass the buffer of a std::string to a C interface to retrieve a string instead of passing in a vector<char> instead and initializing the string with iterators from that:

std::string text;
text.resize(GetTextLength());
GetText(text.data());
WhozCraig
  • 65,258
  • 11
  • 75
  • 141
chris
  • 60,560
  • 13
  • 143
  • 205
  • I honestly don't know, which is why I'm asking, but is hitting up `&text[0]` or `text.begin()` not going to give you want you want in the sample case provided? (or perhaps I've been using them wrong, which wouldn't shock me at this point =P). – WhozCraig Jan 12 '13 at 06:29
  • 2
    Nitpicking : a *good* C-Api should take the length also, so it should be `GetText(text.data(), text.size());` :P – Nawaz Jan 12 '13 at 06:34
  • @WhozCraig, Well, #1 in the second box is supposed to say `operator[]`, though it looks pretty bad when it accepts code, so I'll leave it for now. That should rule out `&text[0]`, but it's weird because we can use `operator[]` to modify it anyway, but can't modify it through an equivalent pointer. I'll see what `begin()` has to say. – chris Jan 12 '13 at 06:36
  • @Nawaz, True, and I should `resize` it to `length + 1` as well, but I decided not to edit it just for that. – chris Jan 12 '13 at 06:37
  • 1
    @Nawaz an excellent point. – WhozCraig Jan 12 '13 at 06:37
  • @chris #1 may return the pointer, but as `const` which is what I think the standard is trying to specify as the differential between the two (`&operate[](size_t)` and `data()`, etc). But it is till an interesting question. – WhozCraig Jan 12 '13 at 06:39
  • @WhozCraig, I see your point, and thank you very much for that edit! I had no idea you could do that. – chris Jan 12 '13 at 06:40
  • @chris lol. neither did I until just now, figured it was worth a try, and it looked good in preview =P – WhozCraig Jan 12 '13 at 06:41
  • Perhaps they didn't want to say "The program shall not set any of the elements of the character array to *zero*, thus invalidating the internal state of the object". – s.bandara Jan 12 '13 at 06:49

1 Answers1

37

Why can't we write directly to this buffer?

I'll state the obvious point: because it's const. And casting away a const value and then modifying that data is... rude.

Now, why is it const? That goes back to the days when copy-on-write was considered a good idea, so std::basic_string had to allow implementations to support it. It would be very useful to get an immutable pointer to the string (for passing to C-APIs, for example) without incurring the overhead of a copy. So c_str needed to return a const pointer.

As for why it's still const? Well... that goes to an oddball thing in the standard: the null terminator.

This is legitimate code:

std::string stupid;
const char *pointless = stupid.c_str();

pointless must be a NUL-terminated string. Specifically, it must be a pointer to a NUL character. So where does the NUL character come from? There are a couple of ways for a std::string implementation to allow this to work:

  1. Use small-string optimization, which is a common technique. In this scheme, every std::string implementation has an internal buffer it can use for a single NUL character.
  2. Return a pointer to static memory, containing a NUL character. Therefore, every std::string implementation will return the same pointer if it's an empty string.

Everyone shouldn't be forced to implement SSO. So the standards committee needed a way to keep #2 on the table. And part of that is giving you a const string from c_str(). And since this memory is likely real const, not fake "Please don't modify this memory const," giving you a mutable pointer to it is a bad idea.

Of course, you can still get such a pointer by doing &str[0], but the standard is very clear that modifying the NUL terminator is a bad idea.

Now, that being said, it is perfectly valid to modify the &str[0] pointer, and the array of characters therein. So long as you stay in the half-open range [0, str.size()). You just can't do it through the pointer returned by data or c_str. Yes, even though the standard in fact requires str.c_str() == &str[0] to be true.

That's standardese for you.

Community
  • 1
  • 1
Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • Thanks for the answer. On your first point, it's only "rude" if it was const in the first place. Having non-const data, returning a const pointer to it, and casting the pointer to one you can use is ok since the data itself isn't const, which is what I was thinking about. However, having the possibility of premade constant memory being returned would mess that all up. – chris Jan 12 '13 at 07:19
  • 1
    I was just about to ask if I could let the API overwrite the null terminator, but then I read your link :p I'm just as happy passing `&str[0]` anyway :) – chris Jan 12 '13 at 07:27
  • 2
    @chris: There's a difference between "allowed" and "polite". A `const` object is a contract between you and some other code. By un-`const`ing it, you're breaking that contract. Which may be permitted under certain conditions by the language, but it is rude to whatever code gave you that object it told you not to touch. If someone tells you not to sit down on their sofa, and you do, they may not throw you out of their house for it. But they're not going to look kindly on it either. – Nicol Bolas Jan 12 '13 at 07:27
  • Well, by "rude", I assumed you meant "undefined behaviour". Guess it goes to show... well, you know the saying. – chris Jan 12 '13 at 07:29
  • Sure you can do it through the pointers returned by `data` and `c_str`, *because of the guarantee you rephrased*. That guarantee gives you the guarantee that the returned pointer points to modifiable memory (for all but the 0-terminator, which need not be set if you didn't use `c_str`) – Deduplicator Sep 13 '14 at 17:11
  • Why is it "*perfectly valid*" to modify the array via the `&str[0]` pointer? It seems to be explicitly forbidden from the `c_str` contract. – tmyklebu Feb 08 '15 at 06:54
  • 1
    There are many cases where "copy-on-write" may still be a good idea. An embedded-system implementation might benefit greatly from recognizing that a string is being constructed from text stored in ROM and have the string object identify the text rather than allocate a heap object for it; copying a string or portion thereof whose text is in ROM would not require allocating new heap storage for a copy of the text. Applying copy-on-write to things in RAM as well may be too complicated to justify the text, but if the implementation can use an implementation-specific... – supercat Nov 12 '15 at 17:57
  • 1
    ...means of determining whether a string is stored in ROM, applying copy-on-write to those cases may be easier than applying it to RAM strings, but still offer much of the benefit [not just speed--copy on write would make it possible for small embedded systems to use strings in ROM which are larger than the entire RAM of the device!]. – supercat Nov 12 '15 at 18:00