17

I'm curious if this sort of thing is legal:

std::vector<some_class_type> vec;
vec.reserve(10);
some_class_type* ptr = vec.data() + 3; // that object doesn't exist yet

Note that I'm not attempting to access the value pointed to.

This is what the standard says about data(), but I'm not sure if it's relevant:

Returns: A pointer such that [data(),data() + size()) is a valid range. For a non-empty vector, data() == &front().

Cornstalks
  • 37,137
  • 18
  • 79
  • 144
Pubby
  • 51,882
  • 13
  • 139
  • 180
  • 4
    I'd say the initialization of the pointer per se is legal. Dereferencing it is UB. – πάντα ῥεῖ Dec 20 '14 at 15:48
  • From what I can see, what you're doing with `reserve` works like `malloc` where the memory is allocated but is not initialized. To be honest, I'm not entirely familiar with the Standard, but logically if this isn't legal then `malloc` isn't legal either. – Nard Dec 20 '14 at 15:53
  • 2
    I guess it all depends on whether the array memory _must_ be allocated at the time `reserve` is called or if it can be deferred until first access. If there is no explicit requirement in the standard I suppose some implementations could defer the allocation which might least to UB. Not sure the call to `data` would be considered "first assess' or enough to invoke the allocation if that is the case. – Captain Obvlious Dec 20 '14 at 15:56
  • @Nard: for the sake of completion, responding here also-- the difference between ```reserve``` and ```malloc``` is that future resize operations (trigged, for instance, by the growth of a vector past its capacity) will cause the data to be reallocated, possibly making the region in memory pointed to by ```ptr``` invalid. – EyasSH Dec 20 '14 at 16:24
  • @EyasSH In other words, `reserve` does not necessarily call `malloc` or `realloc` and is simply defined by the standard as a function that **informs a vector of a planned change in size** so what it actually does is implementation-dependent, am I right? So the fallacy that lies with thinking of `reserve` as `malloc` is that memory might not be allocated at the point of time `reserve` is called, therefore what OP is doing is definitely undefined behaviour, I suppose? – Nard Dec 20 '14 at 17:35
  • @EyasSH: That comparison doesn't make sense. Whether memory is returned by `malloc` or `reserve`d, growing it (with `realloc` or another `reserve`) will likely (but not certainly) invalidate the old ptr. No difference. – MSalters Dec 20 '14 at 18:45
  • @Nard reallocation happens when reserve is called and not at the first access (\[vector capacity\]). The standard doesn't seem to define what reallocation is exactly though. – n. m. could be an AI Jul 05 '21 at 05:49

5 Answers5

1
  1. data must return a pointer to a valid range ([vector.data]). Pointers that point into that range, including the one that points one past the last element of the range, must be stable until next reallocation. When size() is zero, data() points past the end of an empty range, which is a perfectly valid pointer to keep (but not to dereference of course).
  2. Since data must remain stable for the next 10 insertions, we can assume that a storage region suitably sized and aligned for placing an array of 10 elements in it was obtained, and data points to the beginning of that region. That is, the implementation must call an allocation function and set the internal data pointer to its return value. (There is no direct indication in the standard that this must be true, but I cannot imagine a situation where it might be false. I assume that reallocation mandated ([vector.capacity]) by the standard actually calls one of the allocation functions such as ::operator new or std::malloc, otherwise calling this operation "reallocation" would be rather dubious. At any rate, there seems to be no way to avoid such allocation on any current architecture).
  3. Any such storage region actually contains a live array of an appropriate size ([intro.object]) (though not necessarily live array elements, depending on the element type). Pointer arithmetic within such an array is valid, even though dereferencing resulting pointers is not necessarily valid.
n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
  • I think that your assumption in 2 is right in most of the cases. I have however not found anything in the standard that guarantees that a reallocation or even an allocation takes place for an empty vector before any element was ever inserted into the vector (see my quotes, which referred to C++11 or C++14 and might since have been renumbered). – Christophe Jul 05 '21 at 12:18
  • And there is also this other answer to another question which seems to share a similar doubt: https://stackoverflow.com/a/38972840/3723423 – Christophe Jul 05 '21 at 12:33
  • @Christophe You are right, that empty vectors usually don't hold storage. But the reserve() call forces the vector to allocate memory immediately, even though still empty. – Kai Petzke Aug 22 '21 at 14:04
0

In most implementations of the STL, a reserve of an empty vector will trigger a reallocation and ensure that the data your are pointed it is owned/managed.

The location of the data (the value of the pointer as returned by data()) might change when a vector is resized. Holding a pointer per se is of course legal, dereferencing it for a read while un-initialized is of course undefined, and derefering it after initialized is only legal if you can guarantee that your vector did not resize and as such the range you allocated is still in the same place.

Incrementing a pointer to data that has been malloc'd is fine. In this example, you perform pointer arithmetic to hold a pointer to data that you know has been allocated by the std::vector. Regardless of whether the element pointed to by the pointer is ever initialized, a resize operation is problematic as it might deallocate the memory you are pointing to.

EyasSH
  • 3,679
  • 22
  • 36
  • I don't see how the pointer itself can change. The data pointed to by the pointer, however, might change during a `resize` operation. – Nard Dec 20 '14 at 15:56
  • 1
    @Nard The pointer can become invalid at any time when capacity changes, because the underlying implementation might use `realloc` which can return a different address if the current memory block cannot be extended in-place. – SirDarius Dec 20 '14 at 15:59
  • 1
    I'm pretty sure the pointer returned by ```data()``` can change. ```std::vector``` reserves the right to move the block of data around. This is how a vector grows its underlying array while keeping it contiguous. – EyasSH Dec 20 '14 at 15:59
  • 1
    @Nard The pointer used for `data()` well can change due to a `resize()` operation. – πάντα ῥεῖ Dec 20 '14 at 15:59
  • This information about *when to dereference the pointer* may be valid, but the question explicitly states that **there is no dereferencing**. If you could expand on the "of course [its] legal" part, that may be a helpful answer. – Drew Dormann Dec 20 '14 at 15:59
  • The pointer returned by `data()` can change when the memory is reallocated, but inserting elements won't cause reallocations until the needed size exceeds the capacity. –  Dec 20 '14 at 16:00
  • @SirDarius, EyasSH, πάντα ῥεῖ What I'm saying is that the **data pointed to by the pointer** can change but the **location of the pointer itself** shouldn't change from a `resize` operation. – Nard Dec 20 '14 at 16:02
  • 1
    @Nard - the **location of the data** can change from a resize. The pointer returned by ```data()``` is a pointer to the location of the data. As such, the value returned by ```data()``` can change. – EyasSH Dec 20 '14 at 16:03
  • @EyasSH In your answer, you said that "the location of the pointer returned by `data()` might change". :/ – Nard Dec 20 '14 at 16:06
  • Dereferencing a pointer to uninitialized memory is in fact OK: `int *ptr = new int; *ptr = 42`. Reading is UB. – MSalters Dec 20 '14 at 18:56
0

Summary

We cannot assume that this pointer is valid. Moreover, the pointer arithmetic vec.data() + 3 might be UB. And since nothing guarantees that it is not UB, if this code works it's implementation dependent.

Note: this answer was reworded to make a better distinction between UB and at risk of being UB.

Language-lawyer reasoning

Isn't this vector empty?

In you code snippet, you use reserve() and data() on an empty vector vec that has a size() of 0. We know two things from your own quote of the standard :

  • first, data() returns a pointer such that [data(), data() + size()) is a valid range. In your case, we therefore can assume that [data(), data()+0) is a valid range. But there is no such guarantee for [data(), data()+capacity()). Your standard library could provide such an implementation dependent guarantee, but you cannot be certain in general. If not, the expression would definitively be UB (explanation further down the road).
  • second, for a non-empty vector, data() returns the address of the first element. This is an assumption you make (otherwise you wouldn't add a fixed index). But since your vector is empty, you cannot be sure. An implementation is for example perfectly allowed to return a nullptr for an empty vector, regardless of its capacity.

Why is the valid range so important?

A fundamental rule is often ignored: a pointer arithmetic operation is UB if it isn't within the range of a valid array object. The standard expresses this more formally:

[expr.add]/4: When an expression J that has integral type is added to or subtracted from an expression P of pointer type, the result has the type of P.

  • If P evaluates to a null pointer value and J evaluates to 0, the result is a null pointer value.
  • Otherwise, if P points to an array element i of an array object x with n elements, the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) array element i + j of x if 0 ≤ i + j ≤ n and the expression P - J points to the (possibly-hypothetical) array element i − j of x if 0 ≤ i − j ≤ n.
  • Otherwise, the behavior is undefined.

This means that if you add an integer to a pointer, and the result would be out of the valid range, the expression itself would be UB, before a pointer result is even computed. This means also that adding anything else than 0 to nullptr would be UB as well.

So if the vector implementation of your library strictly complies to the standard, without any additional guarantee, your code would be UB because of this pointer arithmetic rule (out of range). But since we do not know what your implementation does, we cannot be sure of UB. The only thing we are sure is that UB cannot be excluded and the code is not portable.

Additional thoughts

It may be tempting to believe that reserve() guarantees memory for the vector being allocated and hence the validity of the range [data(), data() +capacity()) ensured. But this is not at all the case: the pointer arithmetic rule is not about allocated memory but about array element of an array object with n elements.

An implementation could well allocate memory and create the array object of the exact right size using placement new, to preserve the addresses of the existing elements. It would not be a super efficient implementation, but it'd be a legal one.

Moreover, the standard gives for reserve() and capacity() guarantees about the absence of reallocation:

[vector.capacity]/4: A directive that informs a vector of a planned change in size, so that it can manage the storage allocation accordingly. After reserve(), capacity() is greater or equal to the argument of reserve if reallocation happens; and equal to the previous value of capacity() otherwise. Reallocation happens at this point if and only if the current capacity is less than the argument of reserve().

[vector.capacity]/1: Returns: The total number of elements that the vector can hold without requiring reallocation.

But as long as the vector stays empty, no element might ever be reallocated. So a standard compliant implementation has not to worry on any reallocation and could delay the first allocation just in time before the first element is inserted in the vector. I would personally not implement it like that, but it would be legal and cannot be excluded. The fact that data() is not obliged to return a pointer to a first element when the vector is empty, seems tailor made to allow this kind of implementation.

Final word

Your code will work with mainstream implementations, since it is quite common for practical reasons for reserve() to trigger allocation/reallocation. But if you want portable code, that works perfectly also on exotic microcontroller architectures in mission critical systems with lives at risks, then you'd better avoid such shortcuts. 


Christophe
  • 68,716
  • 7
  • 72
  • 138
  • I'm not doing this to an empty vector. Also, that wording guarantees that the memory exists after calling reserve. – Pubby Dec 20 '14 at 16:14
  • 1
    @Pubby: In the code you posted you're doing it to an empty vector. – Cornstalks Dec 20 '14 at 16:14
  • "allocate the needed capacity only when adding the first element" This seems inconsistent with the wording in the quote that says: *Reallocation happens at this point[...]* – Cornstalks Dec 20 '14 at 16:15
  • Sorry I meant empty as in `capacity() == 0` – Pubby Dec 20 '14 at 16:17
  • @Pubby Your example is definitively an empty vector. If you have any doubt about it, just verify its `size()`. The size corresponds to the number of elements that were effectively constructed in the vector. It shouldn't be confused with its capacity(). – Christophe Dec 20 '14 at 16:18
  • The working is murky. It guarantees that the memory will exist after calling reserve() with a greater capacity. Implementations can create initial vectors with >0 capacities, and are not obliged to allocate these. Imagine, for instance, 1- create empty vector (capacity set to 8, data unallocated), 2- call reserve(5) -- *if and only if* condition is not met, 3- insert first element. Implementations are free to allocate the memory only at point "3" in this case. – EyasSH Dec 20 '14 at 16:20
  • The wording does seem to guarantee, however, that a ```reserve()``` larger than the initial capacity **must** trigger an allocation, even for an empty vector. – EyasSH Dec 20 '14 at 16:21
  • Can the person who downvoted my answer please demonstrate what is wrong with my answer ? – Christophe Dec 20 '14 at 16:22
  • 3
    I didn't downvote it, but I'm assuming its because the "if and only if" language seems to imply that reserve will guarantee that you'll have allocated memory of at least that size after calling it (because if you didn't, then that'd be because your capacity is larger), and they are assuming that a capacity > 0 guarantees that a corresponding ```data``` region exists. – EyasSH Dec 20 '14 at 16:28
  • @EyasSH But nothing ensures -as far as I know- that the initial capacity when a vector is created is 0. So if the implementation decides to allocated capacity by group of 10 items, you would not trigger a reallocation. – Christophe Dec 20 '14 at 16:28
  • You are absolutely right. Which is why I expanded on what you said in the comments to clarify exactly that situation. You are also right that the STL standard **does not** define the initial capacity of a vector. According to (http://stackoverflow.com/a/23415582/864313) "all well-known implementations use 0 as the default capacity", which is why some might consider your answer irrelevant. – EyasSH Dec 20 '14 at 16:31
  • 1
    Remarks: Reallocation invalidates all the references, pointers, and iterators referring to the elements in the sequence. **No reallocation shall take place during insertions that happen after a call to reserve() until the time when an insertion would make the size of the vector greater than the value of capacity()** -- If I call `reserve` then it lazy allocation isn't possible... – Pubby Dec 20 '14 at 16:54
  • @EyasSH I fully agree that in most implementation it will work. But the question is about standard compliance. In mission critical environment, an invalid assumption in this regard may cause the mission to fail. – Christophe Dec 20 '14 at 16:55
  • @Pubby in the case i describe at the first insertion you don't have a "reallocation" but an "allocation". According to your standard quotation, my hypothetical but compliant lazy-bloc allocation scheme wouldn't invalidate anyting, because when you took the pointer it was only valid for the range of size 0. – Christophe Dec 20 '14 at 17:01
  • I don't see how this makes sense. Sure, it's possible that there's an initial capacity >= 10. That would just make `data()+3` valid regardless of the `reserve`. The real question is: is it legal to access `data()+I` if `I < capacity()` ? – MSalters Dec 20 '14 at 18:50
  • Christophe is suggesting a scheme where, for an empty (size() == 0) vector with capacity > 0, a vector implementation chooses to lazily allocate memory. – EyasSH Dec 20 '14 at 19:03
  • @MSalters The hidden question of OP is: can he rightfully use `*ptr` if he later does `vec.resize(4)`, assuming that his reservation already allocated the space. Although it would work in most implementations, my only point here is to underline that the C++ standard does not guarantee this to work. – Christophe Dec 20 '14 at 19:19
  • @Christophe: Why? The standard specifically bans vector from doing a reallocation unless necessary. See Pubby's quote above. – MSalters Dec 21 '14 at 03:10
  • 1
    The standard uses the term "reallocation" for both "initial allocation" and "(true) reallaction". This is clear from the following sentence: "Reallocation happens at this point if and only if the current capacity is less than the argument of reserve()." If capacity was zero, no allocation has happened, and the reallocation becomes the initial allocation. – Kai Petzke Aug 22 '21 at 14:01
  • @KaiPetzke this is an interesting thought; do you have any objective evidence (e.g another standard quote, or technical report from the standard committee) that would support it? But let’s suppose it would be do: this is secondary to the question: the specification of data() allows it to return nullptr or any other unrelevant pointer for an empty vector (it is not obliged to return the address of the first element in this case, even if the allocation was done) – Christophe Aug 22 '21 at 14:21
  • @Christophe See below, I have put all my thoughts together in an independent answer now. Short version: The standards forbids pointer invalidations except for reallocations. As `data()` is a pointer, implementations may only change it at reallocations. So for an empty vector with capacity zero, they can return `nullptr`, but for a vector with non-zero capacity, they have to return the address, where the first element will go, as they must not change the `addr()` pointer without reallocation. – Kai Petzke Aug 22 '21 at 15:06
-1

In the remarks to vector::reserve(), the C++17 standard states: "No reallocation shall take place during insertions that happen after a call to reserve() until the time when an insertion would make the size of the vector greater than the value of capacity()."

In the remarks to vector::shrink_to_fit(), the same standard states: "Reallocation invalidates all the references, pointers, and iterators referring to the elements in the sequence as well as the past-the-end iterator. If no reallocation happens, they remain valid."

Combing these two leads to this statement: After a call to reserve(), no pointers must be invalidated by any insertions, as long, as the initially reserved capacity is respected. As the value returned by data() is clearly a pointer, the rule applies to it. So, if an application calls reserve() with a positive number, the implementation must immediately set its data() pointer. It may only change it, when a reallocation may take place.

Some people might say, that the standard talks about "pointers to individual elements", that don't get invalidated, but data() doesn't point to any element, if the vector has reserved space, but is still empty. But well, the standards says: "pointers ... referring to the elements in the sequence". And data() is clearly referring to all the elements in the sequence. As mathematically, an empty set is still a set and an empty sequence is still a sequence, a data() pointer pointing to the empty sequence must nonetheless not be invalidated, when that sequence is later enlarged (unless, of course, if its capacity() is exhausted).

But what about pointer arithmetics into empty space, e.g. vec.data() + 3? Well, in C, this is clearly be a valid operation, as we know, that vec.data() points to space for 10 elements, so advancing three elements into it is fine. In C++, this C like pointer arithmetic is still legal, as long, as we never dare to dereference those pointers before they become valid.

Kai Petzke
  • 2,150
  • 21
  • 29
  • There is apparently a wrong assumption about data(). The fact that it returns a pointer that should not be invalidated is only true for non empty vectors. The spec of data() is formulated in a way that there is no obligation to return a pointer to the first element if the vector is empty, so the absence of reallocation does not come to play here. – Christophe Aug 22 '21 at 15:23
  • And here a case of someone who experienced the problem: https://stackoverflow.com/q/38972443/3723423 – Christophe Aug 22 '21 at 15:28
  • Thanks for the link to the person with the problem. Unfortunately, he does not mention compiler and library used, nor his code. So we do not know, if his code crashes because of this particular problem or something else! Especially, that guy says: "Since iterating data is then crashing ..." He doesn't even say, that he inserted enough data to iterate over it! – Kai Petzke Aug 22 '21 at 15:44
  • Please point me to where in the `vector` description of the standard, it says: "The fact that it returns a pointer that should not be invalidated is only true for non empty vectors." Rather, the standard is very clear: "If no reallocation happens, they [all iterators, pointers, references] remain valid." The equality data() == addressof(front()) is restricted in the standard to non-empty vectors only, because `front()` is an illegal operation on an empty container. – Kai Petzke Aug 22 '21 at 15:58
  • Section [vector.data] (aka 22.3.11.4 for C++20) says "returns a pointer such that [data(), data()+size()] + is a valid range. For a non-empty vector data()==addressof(front)". This last sentence is a key here: for an empty vector, there is no guarantee that data points to the first element at all. Any other claim is pure interpretation. Fortunately, it may work as you think on mainstream implementations, but you have no certainty that this is portable on exotic implementations. – Christophe Aug 22 '21 at 19:36
-2

Of course, it's legal. The quote you mentioned is irrelevant, since size is not equal to "reserved space" which reserve provides. You can also initialize vec.data()+3 before vec[0], although the vector's "size" variable will not be updated.

So, while this use of vector is extremely undesirable, vector is not much more than a thin wrapper for a dynamically allocated array, and an abuse of vector this way is not illegal.

As a rule of thumb: Once you use the vector::data() function, you are doing something genuinely wrong.

cigien
  • 57,834
  • 11
  • 73
  • 112
CasualCay
  • 45
  • 5
  • 1
    I’d strongly recommend to look at this: https://stackoverflow.com/q/10473573/3723423 (**pointer arithmetic** is UB if not within the bounds of a properly allocated array, even if you do not use the pointer at all). OP’s question quote is therefore fully relevant here, since data() is only guaranteeing that the underlying dynamically managed array has an upper bound of size() and OP does arithmetic beyond that size. – Christophe Jul 05 '21 at 07:56
  • Please read OP's and my post again. OP's quote is irrelevant because of his reserve(10) statement right before the data()+3 statement. – CasualCay Jul 05 '21 at 18:00
  • Of course, in theory, you can always write an allocator that reserves nothing for any n (which wouldn't be standard compliant). Then, in this degenerate case, this is operation is illegal. But that's not OP's or my point here. – CasualCay Jul 05 '21 at 18:03
  • Thank you for your feedback. I fear that the question is not about whether it works in general (and if it’s a “degenerate” case), but about the compliance with the standard. And either it is 100% compliant, or it is a risky non portable code. Every week there are reports about cyberattacks and 0 days exploit. Suppose this code works 98% of the case, but what if it’s UB, and the 2% happens on an essential microcontroler embedded in the plane autopilot or a medical ventilation system, where it might put lives at risk? – Christophe Jul 05 '21 at 20:48
  • Well, that's what I'm talking about: It is compliant and it is legal. And I already wrote that I wouldn't recommend it, although I wasn't as dramatic as you are now. :) – CasualCay Jul 05 '21 at 21:12
  • To be frank, I have no clue, why the data() function is still out there other than backwards compatibility to 40 year-old code. – CasualCay Jul 05 '21 at 21:26