41

In C++11, we know that std::string is guaranteed to be both contiguous and null-terminated (or more pedantically, terminated by charT(), which in the case of char is the null character 0).

There is this C API I need to use that fills in a string by pointer. It writes the whole string + null terminator. In C++03, I was always forced to use a vector<char>, because I couldn't assume that string was contiguous or null-terminated. But in C++11 (assuming a properly conforming basic_string class, which is still iffy in some standard libraries), I can.

Or can I? When I do this:

std::string str(length);

The string will allocate length+1 bytes, with the last filled in by the null-terminator. That's good. But when I pass this off to the C API, it's going to write length+1 characters. It's going to overwrite the null-terminator.

Admittedly, it's going to overwrite the null-terminator with a null character. Odds are good that this will work (indeed, I can't imagine how it couldn't work).

But I don't care about what "works". I want to know, according to the spec, whether it's OK to overwrite the null-terminator with a null character?

Xeo
  • 129,499
  • 52
  • 291
  • 397
Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • Does a null character not equal a null terminator? Wouldn't they both be '\0' or ASCII value 0? – Sean Dawson Oct 05 '12 at 06:05
  • 1
    @Nox: A null terminator is what you call the null *character* that goes at the end of the string, to signal that it is the end of the string. – Nicol Bolas Oct 05 '12 at 06:08
  • I don't see why it would be a bad thing. As long it is a null character so that C can see it is the end of a string it shouldn't cause problems. – Sean Dawson Oct 05 '12 at 06:11
  • 5
    Right, but @NicolBolas's question is not "does it cause a problem", but "does the spec allow for it". – nneonneo Oct 05 '12 at 06:22
  • Why not allowed? You get the pointer to the null terminator and just change it. I don't see any problem with that. – SwiftMango Oct 05 '12 at 07:01
  • 2
    @texasbruce In other words, who cares what the spec allows, if it works on your system use it? Luckily, not everyone has that attitude. –  Oct 05 '12 at 07:02
  • 3
    It should be possible to avoid the problem by using a `std::string` with an extra null character at the end, with length `length + 1`, unless I'm missing something? –  Oct 05 '12 at 07:06
  • C family can do sorta direct memory read and write by using pointer. That's why it is medium level opposed to high level. Any C background programmer will tell you this overwrite is perfectly normal. – SwiftMango Oct 05 '12 at 07:09
  • 7
    @texasbruce That is utterly irrelevant. The point is that nothing in the standard guarantees that the null termination is at a writeable memory location at all. It’s entirely possible (if unlikely) that it’s in read-only memory, for instance. Then any attempt to write to it will crash the program. Any competent C programmer will tell you that you are stark raving mad if you attempt to write portable programs that ignore these effects. It is *not* “perfectly normal” at all. – Konrad Rudolph Oct 05 '12 at 08:31
  • @KonradRudolph: It's perfectly possible that not being portable to all platforms is a conscious design decision, in which case you can make more assumptions, so you don't need to bother with what the spec says because all that matters is what your compiler and your libraries actually implement. – Frerich Raabe Oct 05 '12 at 08:40
  • 4
    @FrerichRaabe Agreed. But that’s a completely different discussion. And even then it doesn’t pay to ignore what the spec says: you may still consciously decide to *break* the spec – but you should know it first. – Konrad Rudolph Oct 05 '12 at 08:44
  • Sure thing. Let the standard rule everything. EOC. – SwiftMango Oct 05 '12 at 13:21
  • This is subject of https://groups.google.com/forum/#!msg/comp.lang.c++.moderated/ynde19RQVIw/NxlBKXQ419IJ – Johannes Schaub - litb Oct 06 '12 at 12:52
  • "In C++11, we know that std::string is guaranteed to be both contiguous and null-terminated" Where does the standard guarantee that the `std::string` is null-terminated? The standard does say it's null terminated if you call '.c_str()` or `.data()`, but where is it stated that it's guaranteed to be null terminated always? – Aykhan Hagverdili Dec 07 '19 at 19:13
  • @Ayxan: It's inferred; essentially, the standard defines things in a way that it's impossible to implement `string` such that the string *isn't* NUL-terminated. `data` returns a pointer such that `p + i` is equal to what you get from `&operator[](i)` within a *closed* range `[0, size()]`. That pointer therefore must itself be NUL-terminated. Further, `data` may not invalidate pointers/iterators to the character sequence, so it cannot allocate a NUL-terminated string when it gets called. The only valid implementation left is to always NUL-terminate the string. – Nicol Bolas Dec 07 '19 at 20:26
  • @NicolBolas an implementation could always have room for one more character in the buffer than the size of the string and add NUL when `.data` or `.c_str` is called. So it's technically possible for a confirming implementation of `std::string` to not be NUL terminated all the time. – Aykhan Hagverdili Dec 07 '19 at 20:39
  • @Ayxan: No it couldn't, because `*(&operator[](size() - 1) + 1)` is required to work (obviously if `size() > 0`) and be the NUL-character. Also, what exactly would be the point of such an implementation? Also, `c_str` is `const`, and in the standard library, calling a `const` function on the same object from two different threads has to be thread-safe. So if a `const` function mutates the stored data, it would need to be done behind a mutex or other synchronization primitive, or otherwise guarantee that the two writes won't interfere with one-another. – Nicol Bolas Dec 07 '19 at 20:43
  • @NicolBolas I didn't know `*(&operator[](size() - 1) + 1)` was guaranteed to be NUL. Thank you for the clarification. – Aykhan Hagverdili Dec 08 '19 at 07:10

4 Answers4

25

Unfortunately, this is UB, if I interpret the wording correct (in any case, it's not allowed):

§21.4.5 [string.access] p2

Returns: *(begin() + pos) if pos < size(), otherwise a reference to an object of type T with value charT(); the referenced value shall not be modified.

(Editorial error that it says T not charT.)

.data() and .c_str() basically point back to operator[] (§21.4.7.1 [string.accessors] p1):

Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].

Xeo
  • 129,499
  • 52
  • 291
  • 397
  • 8
    Does writing '\0' into something which is already '\0' actually count as modifying it? – Michael Anderson Oct 05 '12 at 06:57
  • 1
    I’m not sure what to make of this: since `data()` returns a pointer, `str.data()[str.size()]` is clearly **not** just a reference, it’s a bona fide allocated `CharT` object, and as such can safely be written to. But unfortunately `data()` returns a constant string so it cannot be modified anyway. Getting a pointer to a modifiable string via `&str[0]` yields a pointer to a modifiable contiguous buffer but is there a guarantee that `*(&str[0] + str.size())` is writeable? Not as far as I can see. – Konrad Rudolph Oct 05 '12 at 07:02
  • @Konrad: Why should `data()[size()]` *not* be a reference? – Xeo Oct 05 '12 at 07:04
  • @Xeo Oh, it is of course. But it’s a reference to an object in the same underlying buffer. – Konrad Rudolph Oct 05 '12 at 07:07
  • 6
    @MichaelAnderson, yes, definitely. It writes to memory. – Jonathan Wakely Oct 05 '12 at 07:43
  • @Konrad: Not necessarily when access as `operator[](size())`, it seems. :) – Xeo Oct 05 '12 at 08:04
  • @Xeo But it’s not accessing `operator[]`. What you get back from `data()` is a raw pointer to `CharT`, it cannot call `string::operator[]`. That’s what I meant. Not that it really helps here. – Konrad Rudolph Oct 05 '12 at 08:26
  • 2
    @KonradRudolph Well, `data` and `c_str` return const pointers, so they are out of the question for modifying, anyway. And from 21.4.5 p2 it doesn't really follow that `*(&str[0] + str.size())` is even allowed, since `[]` is only equal to `*(begin()+pos)` for `pos < size()`. I think an implementation is perfectly allowed to hold the string data in a `length` array together with an additional `static const charT` member for the null (of course this means it would have to maintain an additional buffer to return by `data` and `c_str`, but why not?). – Christian Rau Oct 05 '12 at 08:44
  • @ChristianRau That’s what I was alluding to, yes. Although in fact the spec guarantees that constructing a string of length `n` will generate a buffer of length `n + 1` with `data()` pointing to its beginning. So the buffer does exist, as you were implying. But the spec says nothing about whether `str[str.size()]` will still point into that buffer. – Konrad Rudolph Oct 05 '12 at 08:46
  • 2
    @KonradRudolph: It has to. The buffer is *contiguous*. Therefore, `&str[str.size() - 1] == &str[str.size()] - 1` must be true (assuming `length()` is at least 1). If it weren't, then the buffer wouldn't be contiguous. – Nicol Bolas Oct 05 '12 at 08:58
  • 3
    @NicolBolas No. `size()` is an invalid argument for `operator[]` so nothing guarantees that its return value will point to the buffer. For instance (far-fetched), `operator[]` could contain the following logic: `static CharT terminator{}; if (index == size()) return terminator; else return _data[i];`. – Konrad Rudolph Oct 05 '12 at 09:02
  • 1
    @Konrad how does that implementation maintain the requirements of `data`? ("*Returns*: A pointer `p` such that `p + i == &operator[](i)` for each `i` in `[0,size()]`") – R. Martinho Fernandes Oct 05 '12 at 09:14
  • @R.MartinhoFernandes Ah, it’s comparing addresses. That changes everything. In that case, I’m tempted to just ignore the spec here … – Konrad Rudolph Oct 05 '12 at 09:15
  • @ChristianRau: In C++11, an implementation is *required* to return the same underlying buffer for `data()` and `c_str()` too, which means your thought "(*of course this means it would have to maintain an additional buffer to return by data and c_str, **but why not?***)" is going stray!. – Nawaz Jan 12 '13 at 18:45
  • Uh, **what returns this**? I presume it is `operator[]`, but you forgot to actually say that. Also note that the C function won't overwrite the location `&s[s.size()]`, it will overwrite `&s[0] + size()`. – Ben Voigt Oct 20 '14 at 14:55
12

LWG 2475 made this valid by editing the specification of operator[](size()) (inserted text in bold):

Otherwise, returns a reference to an object of type charT with value charT(), where modifying the object to any value other than charT() leads to undefined behavior.

T.C.
  • 133,968
  • 17
  • 288
  • 421
  • I don't see how this resolves the situation: It means you can write `s[s.size()] = '\0';` but it is still not defined (in C++14) that `&s[s.size() - 1] + 1` is dereferencable (let alone yielding a null terminator); so by extension an algorithm that starts at `&s[0]` and increments a `char *` still can't read or write the null terminator. The section [string.require]/4 only defines this pointer arithmetic for the index being strictly less than the size. – M.M Jul 19 '17 at 01:49
  • 2
    @M.M. That's hidden in the specification of `data()`. – T.C. Jul 19 '17 at 01:54
  • 1
    I agree that if `data()` is used as the source of the pointer then it's all good, however I don't see where `&s[0]` gives the same guarantee. Specifically, I don't see anything preventing the implementation not writing the null terminator until `data()` or `c_str()` is actually called (and either writing it or using a dummy for the `&s[s.size()]` case). – M.M Jul 19 '17 at 02:00
  • @M.M Then file an LWG issue. The intent here is pretty clear, so if the wording doesn't match it's in the "obvious defect" category. – T.C. Jul 19 '17 at 02:08
  • I don't know why LWG 2475 is being referred to in past tense, the [status report](https://cplusplus.github.io/LWG/lwg-status.html) shows that it is not yet resolved. – Ben Voigt Jul 19 '17 at 02:18
  • @T.C.: Just grabbed the latest draft, the wording change leaves things... weird? inconsistent? Going to study the proposal in detail. – Ben Voigt Jul 19 '17 at 02:25
  • And there in the notes on the proposal: "Tues PM: This should also apply to non-const `data()`. Billy to update wording." Looks like that hasn't happened, leaving overwriting the terminator still illegal, if done by pointer arithmetic instead of calling the string's `operator[]()`. – Ben Voigt Jul 19 '17 at 02:27
  • @BenVoigt: It should be noted that the current status (updated on this day) of 2475 is "C++17". So the committee seems to believe that it has been resolved and added to C++17. – Nicol Bolas Aug 07 '17 at 01:49
11

According to the spec, overwriting the terminating NUL should be undefined behavior. So, the right thing to do would be to allocate length+1 characters in the string, pass the string buffer to the C API, and then resize() back to length:

// "+ 1" to make room for the terminating NUL for the C API
std::string str(length + 1);

// Call the C API passing &str[0] to safely write to the string buffer
...

// Resize back to length
str.resize(length);

(FWIW, I tried the "overwriting NUL" approach on MSVC10, and it works fine.)

Mr.C64
  • 41,637
  • 14
  • 86
  • 162
  • 2
    I’d go with this solution as well. But it’s unsatisfying that this requires a totally needless allocation of an extra character. Why didn’t the spec just make the null termination writeable? – Konrad Rudolph Oct 05 '12 at 09:12
  • Because that would mean more typing? 'You must not overwrite the null, except with another null'. Perhaps their fingers were getting tired, or it was the end of the day and the bars were open. – Martin James Oct 05 '12 at 09:21
  • @KonradRudolph: I agree that the standard should be changed, making it possible to overwrite a `NUL` with another `NUL`. I see no reason why it shouldn't be possible or should trigger undefined behavior, and I don't like the needless allocation of an extra character either. – Mr.C64 Oct 05 '12 at 11:28
5

I suppose n3092 isn't current any more but that's what I have. Section 21.4.5 allows access to a single element. It requires pos <= size(). If pos < size() then you get the actual element, otherwise (i.e. if pos == size()) then you get a non-modifiable reference.

I think that as far as the programming language is concerned, a kind of access which could modify the value is considered a modification even if the new value is the same as the old value.

Does g++ have a pedantic library that you can link to?

Windows programmer
  • 7,871
  • 1
  • 22
  • 23
  • libstdc++ has a [debug mode](http://gcc.gnu.org/onlinedocs/libstdc++/manual/debug_mode.html) but there are limits to what it will diagnose. It validates iterator operations, but can't notice writes of individual bytes through pointers. – Jonathan Wakely Oct 05 '12 at 07:45