72

In C++11 basic_string::c_str is defined to be exactly the same as basic_string::data, which is in turn defined to be exactly the same as *(begin() + n) and *(&*begin() + n) (when 0 <= n < size()).

I cannot find anything that requires the string to always have a null character at its end.

Does this mean that c_str() is no longer guaranteed to produce a null-terminated string?

BЈовић
  • 62,405
  • 41
  • 173
  • 273
Mankarse
  • 39,818
  • 11
  • 97
  • 141

4 Answers4

80

Strings are now required to use null-terminated buffers internally. Look at the definition of operator[] (21.4.5):

Requires: pos <= size().

Returns: *(begin() + pos) if pos < size(), otherwise a reference to an object of type T with value charT(); the referenced value shall not be modified.

Looking back at c_str (21.4.7.1/1), we see that it is defined in terms of operator[]:

Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].

And both c_str and data are required to be O(1), so the implementation is effectively forced to use null-terminated buffers.

Additionally, as David Rodríguez - dribeas points out in the comments, the return value requirement also means that you can use &operator[](0) as a synonym for c_str(), so the terminating null character must lie in the same buffer (since *(p + size()) must be equal to charT()); this also means that even if the terminator is initialised lazily, it's not possible to observe the buffer in the intermediate state.

Community
  • 1
  • 1
Mikhail Glushenkov
  • 14,928
  • 3
  • 52
  • 65
  • 6
    That doesn't say anything about the string being null-terminated. – jalf Sep 26 '11 at 11:10
  • 22
    While that does not say that the string must be null terminated, it can be inferred from the string requirements. Both `c_str` and `data` must be a O(1) operation, which means that they cannot create a copy on the fly. Additionally, the requirement of matching `operator[]` output means that either it is already nul terminated, or the call to `data`/`c_str` must add the nul terminator prior to returning the pointer. Additionally, the string must have space for that terminator *before* the call to maintain the O(1) requirement. Technically the string need not be nul terminated, but `data()` does – David Rodríguez - dribeas Sep 26 '11 at 11:15
  • @R.MartinhoFernandes: there is no requirement if pos > size, because that would violate the precondition. – Mankarse Sep 26 '11 at 11:15
  • This is the correct answer. I was unclear in my question. `c_str()` actually returns something slightly different from what I stated. – Mankarse Sep 26 '11 at 11:16
  • 2
    Since `c_str` and `data` are both required to be constant time, IMO this pretty much forces the implementation to use null-terminated buffers. – Mikhail Glushenkov Sep 26 '11 at 11:18
  • 7
    Also, the last quote: *Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].* Means that `&operator[](size()) == &operator[](size()-1) + 1` --i.e. if `operator[](size())` returned a reference to a `\0` outside of the string, this requirement could never be met. – David Rodríguez - dribeas Sep 26 '11 at 11:30
  • 9
    @jalf: *That doesn't say anything about the string being null-terminated.* Yes, it does. 21.4.7.1 says that the pointer returned by `c_str()` must point to a buffer of length `size()+1`. 21.4.5 says that the last element of this buffer must have a value of `charT()` -- in other words, the null character. – David Hammen Sep 26 '11 at 12:42
  • 2
    @David and others: The (first) snippet Mikhail posted says nothing about nulls, and nothing about the buffer itself being null-terminated. My point is simply that he said that strings are required to use null-termianted buffers internally, and then post a quote from the standard talking about something completely different. Even with the second snippet, it doesn't say anything about teh buffer itself being null-terminated. – jalf Sep 26 '11 at 12:45
  • 2
    Given that the OP's question is basically "where is the requirement for strings to be null-terminated", I would expect an answer to point to the part of the standard which at least mentions a null. Where in this answer can I see that the result of `operator[]` (whose output you've noted that `c_str` is required to match) must return a null at the end of the string? This answer only gives us half of the inference chain. It tells us that `c_str` is required to return the same thing as something else, which isn't defined in the answer. – jalf Sep 26 '11 at 12:53
  • 1
    @jalf: I don't know what the post looked like when you made that first comment. The post as it stands certainly does answer the question. The standard most certainly does say, in standardese, that `c_str()` must return a pointer to a null-terminated buffer. A non-binding explanatory note that this is the case would have been helpful. Then again, lots of other non-binding explanatory notes elsewhere would also be helpful to those of us who don't speak standardese as a primary language. – David Hammen Sep 26 '11 at 12:56
  • 1
    @DavidHammen again, where, in this post, can I see that the buffer is required to be null-terminated? That's pretty essential information when the answer given is "because it returns the same thing as the buffer". That's not just an explanatory note, it's the entire premise for the answer being correct. – jalf Sep 26 '11 at 13:05
  • 4
    @jalf: "*This answer only gives us half of the inference chain.*" It gives two thirds of the full chain. The one thing that is missing is that the value assigned by default initialization `charT()` is the null character. This is clearly the case when `charT` is `char`. The standard is a bit vague (more than a bit vague) on the meaning of `wchar_t`. – David Hammen Sep 26 '11 at 13:08
  • See also http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2647.html : "This change effectively requires null-terminated buffers." – Mikhail Glushenkov Sep 26 '11 at 13:59
  • 1
    Even the requirement for O(1) does not rule out the possibility of the `charT()` terminator being lazily initialised when `c_str()` is called. The string knows its length, and can make sure that it always has some spare space in which to place the terminator. This means that the buffer does not necessarily always have to be null terminated. – Mankarse Sep 28 '11 at 00:41
  • 1
    @Mankarse Yes, theoretically, but you can't observe the string in the intermediate state. – Mikhail Glushenkov Sep 28 '11 at 10:46
  • @MikhailGlushenkov - It could be observed by reading off the end of the buffer using *(&front() + size). I'm pretty sure that would invoke undefined behaviour though. – Mankarse Sep 28 '11 at 13:39
  • 2
    @Mankarse 21.4.5 says that `front()` is equivalent to `operator[](0)`, so your example still returns null (since `&operator[](0)` is equivalent to `c_str()`). If you use `begin()` instead of `front()`, you'll be effectively dereferencing `end()`, which is undefined. – Mikhail Glushenkov Sep 28 '11 at 17:37
  • 2
    This argument hinges on O(1) time, which doesn't mean the `c_str()` and `data` can't do actual processing. Indeed, one could envision a buffering technique where a fixed number of characters at the end are stored in a seperate buffer and copied when `c_str()` is called (since a fixed amount, is technically still O(1)). Also perhaps the string is using some kind of relocatable OS memory, and calling `c_str()` needs to fixate the memory (to prevent moving). SO while it must somehow be null terminated internally, I don't agree that the address of `front` be a synonym for `c_str`. – edA-qa mort-ora-y Oct 20 '11 at 14:22
  • @edA-qa mort-ora-y See 21.4.1/5 - "The char-like objects in a basic_string object shall be stored contiguously." – Mikhail Glushenkov Oct 20 '11 at 22:24
  • 2
    @ edA-qa mort-ora-y On the question of `&front()` being equivalent to `c_str()` - 21.4.5/9 defines `front()` as being equivalent to `operator[](0)`, and 21.4.7.1/1 says that `c_str()` is the same as `&operator[](0)`. See my reply to Mankarse. – Mikhail Glushenkov Oct 20 '11 at 22:32
  • @MikhailGlushenkov, 21.4.1.7 does indeed say the pointer values are equivalent, not just the contents. Thank you. – edA-qa mort-ora-y Oct 21 '11 at 06:34
  • if the null terminator is guaranteed to be there, it would be nice if it were also defined behaviour to read it. Thus you could dereference *s.end() defined to give a null character. – CashCow Oct 24 '12 at 11:05
23

Well, in fact it is true that the new standard stipulates that .data() and .c_str() are now synonyms. However, it doesn't say that .c_str() is no longer zero-terminated :)

It just means that you can now rely on .data() being zero-terminated as well.

Paper N2668 defines c_str() and data() members of std::basic_string as follows:

 const charT* c_str() const; 
 const charT* data() const; 

Returns: A pointer to the initial element of an array of length size() + 1 whose first size() elements equal the corresponding elements of the string controlled by *this and whose last element is a null character specified by charT().

Requires: The program shall not alter any of the values stored in the character array.

Note that this does NOT mean that any valid std::string can be treated as a C-string because std::string can contain embedded nulls, which will prematurely end the C-string when used directly as a const char*.

Addendum:

I don't have access to the actual published final spec of C++11 but it appears that indeed the wording was dropped somewhere in the revision history of the spec: e.g. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2011/n3242.pdf

§ 21.4.7 basic_string string operations [string.ops]

§ 21.4.7.1 basic_string accessors [string.accessors]

     const charT* c_str() const noexcept;
     const charT* data() const noexcept;
  1. Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
  2. Complexity: constant time.
  3. Requires: The program shall not alter any of the values stored in the character array.
sehe
  • 374,641
  • 47
  • 450
  • 633
  • @R.MartinhoFernandes: my edit and your comment must have crossed posts? – sehe Sep 26 '11 at 11:09
  • 1
    Yeah, sorry about that. Regarding your edit I'd like to note that the FDIS wording is *very* different from this and the requirement for null-termination is not this obvious, but it's ninja'ed in :) – R. Martinho Fernandes Sep 26 '11 at 11:18
  • dug up some more revisions. Now, who buys me that copy of the spec ;) – sehe Sep 26 '11 at 11:34
  • Please escape the Square brackets that appear as part of `Operator[](i)` in your post, since they are currently interpreted as a link, which makes the text impossible to understand. – Kevin Cathcart Sep 26 '11 at 16:10
  • @Kevin: sry about that, fixed – sehe Sep 26 '11 at 16:15
10

The "history" was that a long time ago when everyone worked in single threads, or at least the threads were workers with their own data, they designed a string class for C++ which made string handling easier than it had been before, and they overloaded operator+ to concatenate strings.

The issue was that users would do something like:

s = s1 + s2 + s3 + s4;

and each concatenation would create a temporary which had to implement a string.

Therefore someone had the brainwave of "lazy evaluation" such that internally you could store some kind of "rope" with all the strings until someone wanted to read it as a C-string at which point you would change the internal representation to a contiguous buffer.

This solved the problem above but caused a load of other headaches, in particular in the multi-threaded world where one expected a .c_str() operation to be read-only / doesn't change anything and therefore no need to lock anything. Premature internal-locking in the class implementation just in case someone was doing it multi-threaded (when there wasn't even a threading standard) was also not a good idea. In fact it was more costly to do anything of this than simply copy the buffer each time. Same reason "copy on write" implementation was abandoned for string implementations.

Thus making .c_str() a truly immutable operation turned out to be the most sensible thing to do, however could one "rely" on it in a standard that now is thread-aware? Therefore the new standard decided to clearly state that you can, and thus the internal representation needs to hold the null terminator.

Mankarse
  • 39,818
  • 11
  • 97
  • 141
CashCow
  • 30,981
  • 5
  • 61
  • 92
  • The old `string` also had the strange property that the first non const `begin()` would invalidate iterators! – curiousguy Sep 10 '15 at 22:27
2

Well spotted. This is certainly a defect in the recently adopted standard; I'm sure that there was no intent to break all of the code currently using c_str. I would suggest a defect report, or at least asking the question in comp.std.c++ (which will usually end up before the committee if it concerns a defect).

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • no need... http://groups.google.com/group/comp.std.c++/browse_thread/thread/329a41d93cf9c2e8 – sehe Sep 26 '11 at 11:08
  • Well, there are bits in the FDIS that are arguably shaky. `21.4.2/2` says that `.data()` for an empty string isn't actually null-terminated (`.data()+1` is not valid, but should be a pointer one beyond the `\0`) – MSalters Sep 26 '11 at 12:31