Why avoiding memory allocation/deallocation in a recursive code like this is not saving run time?

Question

Suppose that one has a function foo_with_mem_allocs that operates by recursion over a vector of ints, such that it needs to allocate/delete a buffer array in at every recursion step (similar to what a merge_sort algorithm would do):

void foo_with_mem_allocs(vector<int>& v, int c)
{
    if (c > 1)
    {
        c /= 2;
        foo_with_mem_allocs(v, c);

        int *buffer = new int[v.size()];
        for (int i = 0; i < v.size(); i++)
        {
            buffer[i] = v[i];
        }

        for (int j = 0; j < c; j++)
            v[j] = buffer[j] * c;

        delete[] buffer;
    }
}

Now suppose that one wants to avoid the cost of allocating/deleting the array buffer by passing the buffer to the function. My immediate idea was to implement a copy of that same function above, with a minor modification:

void foo_with_buffer(vector<int>& v, int c, vector<int>& buffer)
{

    if (c > 1)
    {
        c /= 2;
        foo_with_buffer(v, c, buffer);

        for (int i = 0; i < v.size(); i++)
        {
            buffer[i] = v[i];
        }

        for (int j = 0; j < c; j++)
            v[j] = buffer[j] * c;
    }
}

I would expect foo_with_buffer to be significantly faster than foo_with_mem_allocs. However, when I run test cases and profile them, what I get is actually that they both take roughly the same amount of time to run for any given size of v and any given value of c. Quite often, the version with the buffer even runs slower:

I would really love to understand why the version without many memory allocations and deallocations is not going faster as I think it should. IF it helps, I tested that both compiling with Visual Studio 2015 (in Release mode, 64bits) under Windows 10 64bits and also compiling with G++ under Unix with g++ -o test test.cpp (where test.cpp is obviously the name of the file I put the whole code in). The picture above with an example of results is from a run under Unix.

Below you can find the entire code, including the profiling test, in a directly reproducible format:

//////////////////////////////////////////////////////////////////////////////

#include <iostream>
#include <ctime>
#include <cstdlib>
#include <vector>
#include <algorithm>
using namespace std;

void foo_with_mem_allocs(vector<int>& v, int c) {

    if (c > 1)
    {
        c /= 2;
        foo_with_mem_allocs(v, c);

        int *buffer = new int[v.size()];
        for (int i = 0; i < v.size(); i++)
        {
            buffer[i] = v[i];
        }

        for (int j = 0; j < c; j++)
            v[j] = buffer[j] * c;

        delete[] buffer;
    }
}
void foo_with_buffer(vector<int>& v, int c, vector<int>& buffer) {

    if (c > 1)
    {
        c /= 2;
        foo_with_buffer(v, c, buffer);

        for (int i = 0; i < v.size(); i++)
        {
            buffer[i] = v[i];
        }

        for (int j = 0; j < c; j++)
            v[j] = buffer[j] * c;
    }
}

int main(int argc, char** argv) {

    // setting a random seed using clock
    srand(time(NULL));
    // print out header of output information
    cout << "Size\tMethod\t\t\tTicks\tSecs" << endl;
    // iterate over possible input data sizes specified in the arguments.
    int sz = 100000;
    vector<int> v(sz);
    vector<int> v1(sz);
    vector<int> v2(sz);
    vector<int> buffer(sz);

    for (int i = 0; i < sz; ++i)
        v[i] = v1[i] = v2[i] = rand();

    // timing foo that does a bunch of memory allocations/deallocations
    clock_t t = clock();
    foo_with_mem_allocs(v1, sz);
    t = clock() - t;
    cout << sz << "\tWith mem alloc\t\t" << t << "\t"
        << (double)t / CLOCKS_PER_SEC << endl;
    // timing foo that uses buffers to avoid memory allocations/deallocations
    t = clock();
    foo_with_buffer(v2, sz, buffer);
    t = clock() - t;
    cout << sz << "\tWith buffer\t\t" << t <<
        "\t" << (double)t / CLOCKS_PER_SEC << endl;

    bool changed = false;
    for (int i = 0; i < v1.size(); i++)
        if (v1[i] != v2[i])
            changed = true;
    if (changed)
        std::cout << "Note: results from the two functions were different.\n" << std::endl;
    else
        std::cout << "Note: results from the two functions were identical.\n" << std::endl;
    return 0;
}
///////////////////////////////////////////////////////////////////////////////

*I tested that both compiling with Visual Studio 2015 under Windows 10 64bits and also compiling with G++ under Unix.* -- You didn't post the optimization settings you used when you compiled the code. If you're timing unoptimized (i.e. debug mode) code, your results are meaningless. This is especially the case for the Visual Studio version, where iterator checks are done in debug mode. — PaulMcKenzie, Oct 11 '17 at 19:10
@PaulMcKenzie Thanks for your comment. No, I am not compiling in Debug mode. I am compiling in Release 64bits mode. In Unix, I simply compiled with `g++ -o test test.cpp`. I edited the question to add these info. — RAs, Oct 11 '17 at 19:20
You need to specify one of the optimization flags, i.e. `-O2` or `-O3`. — PaulMcKenzie, Oct 11 '17 at 19:21
@PaulMcKenzie Yeap, it does work as expected if I do `g++ -o test test.cpp -O2` or the same with `-O3`. Could you give me any insights on why is that needed in this particular case? I mean, why is it that avoiding memory allocation/deallocation is not cheaper from the get go, without the **need** for compiler optimization? More than to solve the situation, I actually am looking for understanding it. Thanks! — RAs, Oct 11 '17 at 19:30
@user - Without optimizations you tell the compiler "Please don't make this run fast". Then it is pretty uninteresting just *how* un-fast it is running. It's like having Usain Bolt walk slowly 100m. — Bo Persson, Oct 11 '17 at 19:43
@user2019840: In addition to what I said in my answer, your recursion is exponential; with `sz` as 10 million, it recurses a couple of dozen times. Since the cost of memory allocation is not related to the size of the allocation (unless the allocated memory is being cleared, which in this case it is not), 24 allocations will almost always have trivial impact relative to the ten-million iteration loop. — rici, Oct 11 '17 at 19:56
@user2019840 Without the optimization `vector` would often significantly be slower than a plain old array just to do things like use `operator[]` since it might not be inlined, e.g. If it's a debug build then it can be considerably slower since it can do things like bounds checking in those functions for safety. Also on top of what `rici` pointed out with heap allocation being relatively a negligible overhead in this case, you're using a size of `sz = 100000`.... — , Jan 04 '18 at 11:31
Where you generally get the biggest bang using the stack or memory already allocated in advance is when you're doing very light work in each function call on very teeny arrays, like say, 64 elements or less. A heap allocation is going to be relatively trivial in most cases where your input sizes are in the range of 100k. Heap allocations become expensive if you are allocating, say, teeny dynamic arrays containing 4-10 elements each a billion times over while doing very light processing. — , Jan 04 '18 at 11:31

rici · Accepted Answer · 2017-10-11T19:34:42.583

3

You are assuming that dynamic allocation must be expensive. You need to re-examine that prejudgement.

Microbenchmarks like this won't reveal the cost of memory allocation because the use case is pretty well optimal. You are always allocating a single buffer of the same size and immediately freeing it before another allocation request is made. Any decent memory allocation library will just give you the same memory every time, taking it from the cache of recently released allocations of the desired size. The code path is very short.

But even in realistic benchmarks, you will discover the result of generations of very talented programmers having worked very hard to optimise memory allocation for your benefit, so that you can use dynamic allocation without having to waste your time with premature optimisations.

As a small postscript, microbenchmarking without enabling optimization can produce other incongruities. For example, unoptimized element access in a std::vector<int> might turn out to be significantly slower than element access in a simple array of ints. This difference might be sufficient to completely overshadow the slight cost of a few dynamic allocations.

edited Oct 11 '17 at 19:34

answered Oct 11 '17 at 19:25

rici

234,347
28
237
341

Many thanks. Rather then trying to do any premature optimization, I am actually trying to understand how these things work. Your answer was quite insightful. So, about your postscript, I then tried to run the code without optimizations, but making the `buffer` be a dynamic array instead of a vector. The same thing happens: both functions then run in roughly the same time, sometimes the one with the buffer even is slower. So I guess the origin of the need for optimizations in this particular case might be somewhere else - for instance in what you'd suggested in the comments to my question – RAs Oct 11 '17 at 20:06
@user2019840: really, the only way to get a good handle on the differences is to examine the code produced by the compiler(s). And even then, you might not see anything obvious, because tiny differences in code generation can affect loop timings. (Even the cache-line alignment of the loop code itself.) The suggestion about `std::vector` is based on what I've seen in certain compilers (the access is compiled to a function call rather than being inlined unless optimization is enabled) but it is hardly universal. (Probably nothing is :) ) – rici Oct 11 '17 at 20:15

score 0 · Answer 2 · answered Oct 12 '17 at 00:15

0

You're saying "I tried this and I got a confusing time result".

That means you don't actually know what accounts for the time. If you don't know what accounts for the time, what you did is guess. Even a good guess is still a guess, and guesses, by their nature, can deceive.

This technique costs nothing, is super easy, and tells you exactly what takes time. No guesswork.

answered Oct 12 '17 at 00:15

Mike Dunlavey

40,059
14
91
135

Pretty odd line of reasoning. 1) if you mean that because I don't know the reason for the time difference then I made a guess, it makes no sense: of course I don't know what accounts or the time, that is what I asked - I made no guesses about that. 2) if you are saying that I measured time wrongly, it is different from saying I guessed - it just means I did it wrong. In any case, I really appreciate you pointing to that post where you explain sampling profiling. I have read that before and even applied it in the past. Thanks for reminding me of it, even if it doesn't answer my question per se – RAs Oct 12 '17 at 02:18
@user2019840: The guess is that memory allocation is the problem. – Mike Dunlavey Oct 12 '17 at 19:13
But you see, I wasn't blaming memory allocations for a code being slow; I was actually trying to understand why their absence didn't make things faster. These are not necessarily the same questions. Anyhow, assuming that memory allocations/deallocations, by definition, are *ceteris paribus* more expensive than not doing memory allocations/deallocations, is not a guess, it is a trivial matter of fact. The issue, as others have pointed, might be that optimizations done by the compiler can easily compensate for that and make the implementation with mem allocations/deallocations still faster – RAs Oct 13 '17 at 02:20

Why avoiding memory allocation/deallocation in a recursive code like this is not saving run time?

2 Answers2