11

The usage here is the same as Using read() directly into a C++ std:vector, but with an acount of reallocation.

The size of input file is unknown, thus the buffer is reallocated by doubling size when file size exceeds buffer size. Here's my code:

#include <vector>
#include <fstream>
#include <iostream>

int main()
{
    const size_t initSize = 1;
    std::vector<char> buf(initSize); // sizes buf to initSize, so &buf[0] below is valid
    std::ifstream ifile("D:\\Pictures\\input.jpg", std::ios_base::in|std::ios_base::binary);
    if (ifile)
    {
        size_t bufLen = 0;
        for (buf.reserve(1024); !ifile.eof(); buf.reserve(buf.capacity() << 1))
        {
            std::cout << buf.capacity() << std::endl;
            ifile.read(&buf[0] + bufLen, buf.capacity() - bufLen);
            bufLen += ifile.gcount();
        }
        std::ofstream ofile("rebuild.jpg", std::ios_base::out|std::ios_base::binary);
        if (ofile)
        {
            ofile.write(&buf[0], bufLen);
        }
    }
}

The program prints the vector capacity just as expected, and writes the output file just the same size as input, BUT, with only the same bytes as input before offset initSize, and all zeros afterward...

Using &buf[bufLen] in read() is definitly an undefined behavior, but &buf[0] + bufLen gets the right postition to write because continuous allocation is guaranteed, isn't it? (provided initSize != 0. Note that std::vector<char> buf(initSize); sizes buf to initSize. And yes, if initSize == 0, a rumtime fatal error ocurrs in my environment.) Do I miss something? Is this also an UB? Does the standard say anything about this usage of std::vector?

Yes, I know we can calculate the file size first and allocate exactly the same buffer size, but in my project, it can be expected that the input files nearly ALWAYS be smaller than a certain SIZE, so I can set initSize to SIZE and expect no overhead (like file size calculation), and use reallocation just for "exception handling". And yes, I know I can replace reserve() with resize() and capacity() with size(), then get things work with little overhead (zero the buffer in every resizing), but I still want to get rid of any redundent operation, just a kind of paranoid...

updated 1:

In fact, we can logically deduce from the standard that &buf[0] + bufLen gets the right postition, consider:

std::vector<char> buf(128);
buf.reserve(512);
char* bufPtr0 = &buf[0], *bufPtrOutofRange = &buf[0] + 200;
buf.resize(256); std::cout << "standard guarantees no reallocation" << std::endl;
char* bufPtr1 = &buf[0], *bufInRange = &buf[200]; 
if (bufPtr0 == bufPtr1)
    std::cout << "so bufPtr0 == bufPtr1" << std::endl;
std::cout << "and 200 < buf.size(), standard guarantees bufInRange == bufPtr1 + 200" << std::endl;
if (bufInRange == bufPtrOutofRange)
    std::cout << "finally we have: bufInRange == bufPtrOutofRange" << std::endl;

output:

standard guarantees no reallocation
so bufPtr0 == bufPtr1
and 200 < buf.size(), standard guarantees bufInRange == bufPtr1 + 200
finally we have: bufInRange == bufPtrOutofRange

And here 200 can be replaced with every buf.size() <= i < buf.capacity() and the similar deduction holds.

updated 2:

Yes, I did miss something... But the problem is not continuity (see update 1), and even not failure to write memory (see my answer). Today I got some time to look into the problem, the program got the right address, wrote the right data into reserved memory, but in the next reserve(), buf is reallocated and with ONLY the elements in range [0, buf.size()) copied to the new memory. So this's the answer to the whole riddle...

Final note: If you needn't reallocation after your buffer is filled with some data, you can definitely use reserve()/capatity() instead of resize()/size(), but if you need, use the latter. Also, under all implementations available here (VC++, g++, ICC), the example works as expected:

const size_t initSize = 1;
std::vector<char> buf(initSize);
buf.reserve(1024*100); // assume the reserved space is enough for file reading
std::ifstream ifile("D:\\Pictures\\input.jpg", std::ios_base::in|std::ios_base::binary);
if (ifile)
{
    ifile.read(&buf[0], buf.capacity());  // ok. the whole file is read into buf
    std::ofstream ofile("rebuld.jpg", std::ios_base::out|std::ios_base::binary);
    if (ofile)
    {
        ofile.write(&buf[0], ifile.gcount()); // rebuld.jpg just identical to input.jpg
    }
}
buf.reserve(1024*200); // horror! probably always lose all data in buf after offset initSize

And here's another example, quoted from 'TC++PL, 4e' pp 1041, note that the first line in the function uses reserve() rather than resize():

void fill(istream& in, string& s, int max)
// use s as target for low-level input (simplified)
{
    s.reserve(max); // make sure there is enough allocated space
    in.read(&s[0],max);
    const int n = in.gcount(); // number of characters read
    s.resize(n);
    s.shrink_to_fit();  // discard excess capacity
}

Update 3 (after 8 years): Many things happened during these years, I did not use C++ as my working language for nearly 6 years, and now I am a PhD student! Also, though many think there are UBs, the reasons they gave are quite different (and some were already shown to be not UBs), indicating this is a complex case. So, before casting votes and write answers, it is highly recommended to read and be involved in comments.

Another thing is that, with the PhD training, I can now dive into the C++ standard with relative ease, which I dared not years ago. I believe I showed in my own answer that, based on the standard, the above two code blocks should work. (The string example requires C++11.) Since my answer is still contentious (but not falsified, I believe), I do not accept it, but rather am open to critical reviews and other answers.

wpzdm
  • 134
  • 1
  • 11
  • I think this is UB too, writing past the end of a vector is always UB. But why not use resize and size instead of reserve and capacity? Then it would be OK. I don't see what you think you are gaining this way. – john Sep 28 '13 at 07:09
  • 1
    If I understand your question, you're reserving, but that just means when/if elements are *legally* added (such as a push_back, insert, or resize) an allocation up to the capacity is already in place to prevent a realloc and copy/move. And you're wondering if this is "ok" anyway? Is that right? If so, I'm going with "no". – WhozCraig Sep 28 '13 at 07:12
  • 3
    ...continued. in fact even this: `&buf[0]` is UB. Nowhere in this code is the vector ever "sized". Only capacity is reserved. A reasonable debug-version of vector that checks for OOB conditions with `operator[]` (the release version won't) will scream assertions. The question you linked opened the accepted answer with "Use resize() instead of reserve()" and it holds true for this as well. If you want the last frame to be be filled perfectly maintain a read-count and perform a final `resize()` *down* to the actual total number of items read. – WhozCraig Sep 28 '13 at 07:26
  • @john "writing past the end of a vector is always UB", maybe, but I can't find a place definitely says this. And I'am using resize and size in my project now. I posted this question mainly for getting whether this's an UB. thx. :) – wpzdm Sep 28 '13 at 07:30
  • @WhozCraig No, `std::vector buf(initSize);` sizes `buf` to `initSize`, and I've debugged the program without any runtime errors or assertions. – wpzdm Sep 28 '13 at 07:34
  • 1
    I don't get this. Why not use `ifile.seekg(0, std::ios_base::end);`, `std::streamsize size = ifile.tellg();` (for the size), and return to the start with: `ifile.seekg(0, std::ios_base::beg);` - given a simple 'disk' file, I don't see how your method could possibly be more effective. At the very least, use Fibonacci sequence steps in size rather than just doubling. – Brett Hale Sep 28 '13 at 07:49
  • @BrettHale yes, this's what I say "we can calculate the file size first and allocate exactly the same buffer size". In my project, the files to be read can be expected MUST be smaller than than a certain `SIZE`, so I can set `initSize` to `SIZE` and expect no anymore overhead (like file size calcution), and use reallocation just for "exception handling". For the same reason, Fibonacci sequence is not worth bothering. thx :) – wpzdm Sep 28 '13 at 07:59
  • 2
    @wpzdm I see. so you're saying regardless of not `resize()`ing to valid size, because you're didn't trip a fault or assertion it should be ok ? It isn't. The contiguous memory your relying on is covered in the standard by: C++11 23.3.6.1. specifically "...&v[n] == &v[0] + n for all 0 <= n < v.size()." If you read into memory for some `n` *not* less than the *reported* `size()`, it isn't defined as sequential by the standard. – WhozCraig Sep 28 '13 at 08:49
  • @WhozCraig thx! I've find nearly the same words in 'The C++ Standard Library: A Tutorial and Reference, 2e' 7.3.3 : _Thus, you can expect that for any valid index i in vector v, the following yields true: &v[i] == &v[0] + i_. When I read through this, I think, "Yes, `&v[n]` is itself undefined if `n >= v.size()`, so `&v[n] == &v[0] + n` must have `0 <= n < v.size()`, but now about `&v[0] + n` alone? Isn't it point to the right position in reserved memory?" – wpzdm Sep 28 '13 at 09:30
  • 1
    I'm still utterly lost at what you're attempting to *save* by *not* resizing ? The expressed reason I cited that specific portion of the standard was its directed at the aforementioned *continuity*. The full sentence reads: "The elements of a vector are stored contiguously, meaning that if v is a vector where T is some type other than bool, then it obeys the identity &v[n] == &v[0] + n for all 0 <= n < v.size()." From that I read you *cannot* rely on continuity unless `n < size()`. It isn't in your case, and therefore it is UB. – WhozCraig Sep 28 '13 at 10:07
  • @WhozCraig Because when `n > v.size()`, `v.resize(n)` needs to default construct `v[i]` for all `v.size() <= i < n` (and destruct when `n < v.size()`). I accept that in my above case, of cause, this overhead can be omitted. And see my updates in the question. – wpzdm Sep 28 '13 at 11:52
  • 1
    So this is all to avoid value-initialization (which fires default-construction for user-defined cv-types, or zero-initializes for pod-types)? That is how `std::vector<>` works. Using it in an undefined way isn't the way to circumvent that. I understand the *"why"*; but the *"how"* isn't right if you step into UB to do it. If this is truly some *measurable* bottleneck (and I can't see how it is, as the file-io should be your sore-point) implementing a sub-featured sequence container that does only what you want would be the approach I would take. But thats just me. – WhozCraig Sep 28 '13 at 17:23
  • But we stil cannot definitely say what I did above is UB, can we? I agree (and have agreed) that the zero-initialization overhead can be omitted in my case. When I wrote that piece of code, my consideration is that _if_ `reserve()/capacity()` works, we _should_ prefer them to `resize()/size()`. And I wrote, compiled and tested, it turned out not, so I wondered why. – wpzdm Sep 28 '13 at 23:47
  • updated again and solved the problem – wpzdm Oct 02 '13 at 03:38
  • Your conclusion is incorrect. Any attempt to use the memory beyond `size()` is risking undefined behavior. Just because you can make it work today, doesn't mean it will work tomorrow. The next C++ standard could add some requirement that needs scratch space, and the implementation of `vector` could use that reserved memory for it - you never know. – Mark Ransom Oct 03 '13 at 01:53
  • @MarkRansom yeah, I should have been more circumspect. updated again. – wpzdm Oct 03 '13 at 12:21
  • 1
    The example you copied from TC++PL doesn't work, see http://ideone.com/2AnmcW – Mark Ransom May 16 '14 at 20:29
  • according to this page (http://en.cppreference.com/w/cpp/container/vector/data), if you can't call the method `data()`, you can get the buffer pointer with `&front()` – user666412 Sep 11 '14 at 21:32
  • @MarkRansom Long time no see. I have left C++ for many years. Regarding your last comment, remove ‘resize’ and use ‘p’ in the output, then it works (https://ideone.com/WsyYoh). I seems ‘resize’ in string does not ensure no reallocation. So, the code in the book indeed does not work, but for a reason not quite relevant for the question here. (It might be worthy to report the problem to the author.) – wpzdm Sep 11 '21 at 02:23
  • @wpzdm congratulations on leaving C++ behind, I'll bet your life is much less stressful. Getting the book's example to work just emphasizes my point - using memory that doesn't belong to you, even when you *know* it's allocated, is going to lead to trouble. Code that works one day may mysteriously stop working the next, and that applies to your "fix" too. The book was wrong to include that example. – Mark Ransom Sep 11 '21 at 03:41
  • @MarkRansom I agree with your general point, but for this specific case, now I believe the code is well-defined. Please see my new answer. – wpzdm Sep 11 '21 at 08:19
  • @WhozCraig My own answer newly added would be interesting to you. – wpzdm Sep 11 '21 at 11:07

2 Answers2

5

reserve doesn't actually add the space to the vector, it only makes sure that you won't need a reallocation when you resize it. Instead of using reserve you should use resize, then do a final resize once you know how many bytes you actually read in.

All that reserve is guaranteed to do is prevent the invalidation of iterators and pointers as you increase the size of the vector up to capacity(). It is not guaranteed to maintain the contents of those reserved bytes unless they're part of the size().

For example, it's common for code built with a Debug flag to include extra features to make it easier to find bugs. Maybe newly allocated memory will be filled with a well defined pattern. And maybe the class will periodically scan that memory to see if it's changed, and throw an exception if it has under the assumption that only a bug could have caused that change. Such an implementation would still be standard conforming.

The example of std::string is even better, because there's a case that's almost guaranteed to fail. string::c_str() will return a pointer to the string with a null terminator character at the end. Now a conforming implementation could allocate a second buffer with room for the terminating null and return that pointer after copying the string, but that would be very wasteful. Much more likely is that the string class will just make sure its reserved buffer has room for the extra null character and write a null there as necessary. But the standard doesn't dictate when that null will be written, it could be in the call to c_str or it could be at any point where the string might be modified. So you have no way of knowing when one of your bytes is going to be overwritten.

If you really want a buffer of uninitialized bytes, std::vector<char> is probably the wrong tool anyway. You should look at a smart pointer such as std::unique_ptr<char> instead.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • No, `reserve()` does actually add the space to the vector. See my final note and the example. – wpzdm Oct 02 '13 at 05:37
  • 1
    @wpzdm, it reserves the space but doesn't officially add it to the controlled sequence of bytes. The difference is important in this application. – Mark Ransom Oct 02 '13 at 05:46
  • What do you mean by 'controlled'? If you mean 'continuously added after `v[v.size()-1]`', than it _is_ (see my update 1). If you mean 'have read/write access to', flankly speeking, I'm not sure, but all compilers available in my hand say, it also is. – wpzdm Oct 02 '13 at 05:59
  • 1
    @wpzdm, I'm talking about the guarantees made by the standard, not implementation details. The bytes may be reserved in memory, but nothing is done with them yet - no objects have been constructed there for example. Any access outside of the `size()` is probably undefined behavior and anything can happen. – Mark Ransom Oct 02 '13 at 15:26
  • 1
    @MarkRansom There is no "probably" about it; it _is_ UB. – underscore_d Sep 10 '16 at 20:15
  • @underscore_d I now believe it is well-defined, see my new anwser. – wpzdm Sep 11 '21 at 08:20
  • Mark, I appreciate your contribution to this question, and hope the answer could be updated to reflect the recent state of the discussions. – wpzdm Sep 16 '21 at 01:09
  • @wpzdm I'm sorry to disappoint you, but none of the ongoing discussion has convinced me that anything from this answer needs updating. I stand by it as originally written. – Mark Ransom Sep 16 '21 at 04:32
  • I am not saying there are something explicitly falsified in this answer, but it could be clarified. For example, without look into discussions (which is not so easy to follow), it would be confusing what mean by “reserve doesn't actually add the space to the vector” and “ ...maintain the contents of those reserved bytes unless they're part of the size().” – wpzdm Sep 16 '21 at 13:12
  • Also, I think “ it only makes sure....” is not accurate, because I showed that it also makes sure we can read/write the memory (though whether the content would be unexpectedly rewritten is contentious). – wpzdm Sep 16 '21 at 13:18
-1

The bold texts in the answer are my main claims. I have given due effort and care by quoting/referring to the standard, but I am open to the possibility that my reading/understanding would have gaps/errors.

I read C++03 standard because it is shorter and easier, and I believe the related parts are in essence the same in the newest standard. In short, there are no UBs in the last two code blocks of the question, because the reserve()ed memory is well-behaved objects, and the effects of vector operations on the objects are defined by the standard.

It was shown, in the Update 1 of the question, that continuous memory is allocated by reserve(), without reallocation, we can get the right addresses into it. (I can provide the respective standard texts if needed.) The more dubious part is whether the allocated memory can be accessed as in the question (basically, whether we can safely read/write the memory). And let us go into this.

First, the memory is not in some "scratch space". reserve() uses vector's allocator to allocate memory. And the allocator uses operator new (standard 20.4.1.1), which in turn calls an allocation function (18.4.1.1). Thus the storage duration is until a deallocation (e.g., delete) is called on the memory (3.7.3). There would be a concern about lifetime, but this is in fact no problem for us (see below).

Second, is it really as Mark said "nothing is done with them yet - no objects have been constructed there"? First of all, what is an object? (1.8) "An object is a region of storage," that "has a storage duration (3.7) which influences its lifetime (3.8)" and also a type (3.9). Importantly for us, "an object is created by [...] a new-expression". Thus, instead of "nothing is done", we should say an object (here of type char) is created using the allocator! (Of course, the object is not initialized, but this is no problem for us.) Also important for us, because char is POD, the lifetime of the allocated object starts as soon as the storage is obtained (3.8 1). For any POD object, we can memcpy from and back into it, and the value stored there remains the same, even if the value is invalid for the type (e.g., uninitialized garbage)! (3.9 2). Thus, we have the right to read/write the memory (as char objects). Moreover, we can use other defined operations of the type (say "="), because the object is in the lifetime.

In general, we can use POD vectors like buffers as suggested in the last part of the question. Particularly, accessing reserve()ed memory of POD vectors out of size() is well-defined. Precisely, we can access the memory pointed by &vec[m] + n, where m < size() and m+n < capacity() (but &vec[m+n] is UB!).

Keeping in mind that we still have the old size(), we can even reason the defined behaviors of vector methods. For example, the memory out of size() will not be copied after reallocation triggered by reserve(). Becausereserve() only allocates (or reallocates) (uninitialized) memory, the container only needs to copy the objects in size() into the reallocated memory, and outside size() the memory should remain uninitialized.

PS: The last example is from the TC++PL 4ed, and should work only for C++11 and above. In C++11 and above the memory of string is continuous, but not for the lower versions (Does "&s[0]" point to contiguous characters in a std::string?).

Edit: Mark made a good point in the comment: even if we can access the reserve()ed memory, would it be written by the vector out of our control? I believe not. Every operation (method, algorithm) on a container has a standard-defined effect, by a specialized "Effects" paragraph, or by overall requirements (23.1). So, if an operation has an effect on reserve()ed memory, the standard should specify it.

For example, the effect of erase(p1,p2) is "erases the elements in the range [q1, q2)" (23.1.1) and "Invalidates iterators and references at or after the point of the erase" (23.2.4.4). Thus, erase() has no effect on reserve()ed memory.

On the other hand, we know insert() has an effect on reserve()ed memory, but this can be reasoned, and in this sense, we are in control. There is nowhere in the standard that says any container operation has the effect that "could periodically wipe out anything beyond [size()]", so it should not do it!

E_net4
  • 27,810
  • 13
  • 101
  • 139
wpzdm
  • 134
  • 1
  • 11
  • Does anyone knows why the question and answers are made community wiki? Could I reverse the process? – wpzdm Sep 11 '21 at 08:36
  • 3
    https://stackoverflow.com/help/privileges/community-wiki – HolyBlackCat Sep 11 '21 at 08:53
  • @HolyBlackCat Hmm, I believe the question is make wiki without informing me (and the answers automatically), could I request to remove the wiki mode? – wpzdm Sep 11 '21 at 09:02
  • I have sent a message through the flag. – wpzdm Sep 11 '21 at 09:10
  • The timeline says [you made the Q a community wiki](https://stackoverflow.com/posts/19064318/timeline) way back in 2013, and [made the A community wiki](https://stackoverflow.com/posts/69141237/timeline) right when posting it (unsure if there was an option to post a regular answer). – HolyBlackCat Sep 11 '21 at 09:15
  • 2
    @HolyBlackCat The Q was automatically converted to wiki because of too many edits (by meself), and then A will be also wiki (see https://meta.stackexchange.com/questions/2974/is-there-a-way-to-remove-community-wiki-status/83373#83373) – wpzdm Sep 11 '21 at 09:19
  • 3
    You're still not getting my main point. Even if the memory is allocated, *you don't own it* - the `vector` or `string` object does. Unless the standard explicitly says otherwise, any access you make that succeeds is purely luck. The standard says you can directly access elements of the container up to `size()` so that part's well defined, but the container could periodically wipe out anything beyond that and still be standard compliant. – Mark Ransom Sep 11 '21 at 16:44
  • @MarkRansom This is an interesting point, and I updated the answer accordingly. – wpzdm Sep 13 '21 at 04:11
  • 2
    If you must write a long post like this to say '_maybe_ it _might_ work', you're doing it wrong. Implication of wording, and certainly established consensus/convention is that elements beyond `size() - 1` can't be accessed; that's it. If you think you found wiggle room in the Standard, file a defect, as it'd mean wording contradicts what experienced implementors expect: you won't access those elements, so they needn't protect you from shooting your foot off. That's UB: don't do it; if you do, you're on your own. Also C++03 is ancient & probably has bugs, so it's not a good reference in 2021... – underscore_d Sep 13 '21 at 10:28
  • Again, you're missing the point. The standard doesn't list possible side effects on reserved memory because it's irrelevant; you're not supposed to access that reserved memory in the first place! – Mark Ransom Sep 13 '21 at 14:33
  • @underscore_d I do not have to write it so long, but doing so to show my effort, make it more convincing, and expect critically reviews. But it seem to me you did not read it (apologize if I am wrong). I would appreciate references to the “established consensus/convention” and “ what experienced implementors expect”. – wpzdm Sep 13 '21 at 14:49
  • @MarkRansom “ The standard doesn't list possible side effects on reserved memory because it's irrelevant.” I am not convinced without references. – wpzdm Sep 13 '21 at 14:51
  • 1
    @wpzdm: The Standard doesn't define what happens when you access reserved memory outside the vector size. That is the very definition of "undefined behavior". It doesn't list possible side effects because it already told you (in the explanation of undefined behavior) that any side effects are possible. – Ben Voigt Sep 14 '21 at 15:27
  • 2
    Don't confuse "`erase` has no effect on the guarantees concerning reserved memory" (true, there were no guarantees before and none after) with "`erase` is guaranteed to have no effect concerning reserved memory" (false). – Ben Voigt Sep 14 '21 at 15:45
  • @BenVoigt UB means nowhere in the whole Standard the behavior is defined (1.3.13). However, although the `vector` section "doesn't define what happens when you access reserved memory", according to the Standard (as referred in the answer), the expression `&v[m]+n` is well-defined and the object pointed by it is well-behaved. – wpzdm Sep 15 '21 at 05:28
  • @BenVoigt "`erase` has no effect on the guarantees concerning reserved memory." This sentence seems problematic by definition, because "effects" are "changes in the state of the execution environment" (1.9 7) but "guarantee" is not a "state". Maybe, you mean, again, "any side effects are possible" because of UBs and no guarantees, but I am not convinced there are UBs. – wpzdm Sep 15 '21 at 05:29
  • 1
    @wpzdm is the expression `&v[m]+n` well-defined for all values of `n`, or only those up to `size()`? The standard is not required to list all operations that are undefined, that would be impossible because that list is infinite. Rather it calls out what's defined, and you need to treat anything outside that scope as undefined. – Mark Ransom Sep 15 '21 at 14:38
  • @wpzdm: When dealing with the abstract model of the standard-compliant execution environment, state is exactly the set of guarantees currently in effect. The model of physical state of a real platform (i.e. what is in each CPU register and RAM address) falls far short of standard compliance, because it doesn't allow analysis of strict aliasing, etc. – Ben Voigt Sep 15 '21 at 15:09
  • @MarkRansom As mentioned in the answer, for those m and n “ `m < size()` and `m+n < capacity() `”. Because then `&v[m]` is defined and returns a pointer `p`, and `p+n` is of course defined. I agree with the rest of the comment, and it does not conflict with my answer. – wpzdm Sep 16 '21 at 12:45
  • @BenVoigt Understood. So your argument is based on that there are UBs and no guarantees, for which I am not convinced. – wpzdm Sep 16 '21 at 12:50
  • @MarkRansom But let me stress again that “ UB means nowhere in the whole Standard the behavior is defined (1.3.13)”. We cannot simply look at the `vector` section and say “it is not defined here, so it is a UB”. – wpzdm Sep 16 '21 at 12:53
  • reserving memory is not necessarily the same as allocating memory. for example, in win32 the VirtualAlloc() flags MEM_COMMIT and MEM_RESERVE. – Spongman Jan 16 '23 at 18:24