slow performance for 3D array delete C++

Question

int newHeight = _height/2;
    int newWidth = _width/2;

    double*** imageData = new double**[newHeight];
    for (int i = 0; i < newHeight; i++)
    {
        imageData[i] = new double*[newWidth];
        for (int j = 0; j < newWidth; j++)
        {
            imageData[i][j] = new double[4];
        }
    }

I have dynamically allocated this 3D matrix. what is the fastest and safest way to free the memory here?

here is that I have done but this takes a few seconds my matrix is big (1500,2000,4)

  for (int i = 0; i != _height/2; i++)
        {
            for (int j = 0; j != _width/2; j++)
            {
                delete[] imageData[i][j];
            }
            delete[] imageData[i];
        }
        delete[] imageData;

Update
As suggested I have chosen this solution:

std::vector<std::vector<std::array<double,4>>>

the performance is great for my case

@πάνταῥεῖ please read the question the measures are (1500,2000,4) — Gilad, Apr 11 '15 at 17:01
Maybe you can use `std::arrays` and they'll be automatically freed when they go out of scope. You can check the doc [here](http://en.cppreference.com/w/cpp/container/array) — Jcao02, Apr 11 '15 at 17:06
@laser_wizard I have used std::vector it will take 3 times time — Gilad, Apr 11 '15 at 17:10
@Jaco02 I think that every template from STL will slow performance badly here. — Gilad, Apr 11 '15 at 17:15
@Gilad I made some test on ideone, and actually `std::array` is way faster. You should try it. — Jcao02, Apr 11 '15 at 17:34
@Gilad if you really need dynamic sizing in the two superior dimensions, there is no reason to dynamically allocate each final dimension static row of `double [4]`. You can eliminate the cost of allocation for that entire dimension (the most expensive part of your algorithm) by conjoining it with the prior. That is *exactly* what you'll get with `std::vector>>`. The RAII semantics are a welcome bonus. Did you test *that* ? — WhozCraig, Apr 11 '15 at 17:47
@WhozCraig and Jcao02 I have used std::vector>> you are right this is crazy fast, std::vector>> was unbelievably slow is allocation and deallocation and double*** is fast allocation however very slow deallocation. please write this is as an answer I will accept it. — Gilad, Apr 11 '15 at 20:19
Best performance will be to use a single flat vector and some index calculation to get from 3d to 1d index. — Neil Kirk, Apr 11 '15 at 20:55

score 5 · Answer 1 · answered Apr 11 '15 at 17:10

5

Allocate the entire image data as one block so you can free it as one block, ie. double* imageData = new double[width*height*4]; delete [] imageData; and index into it using offsets. Right now you are making 3 million separate allocations which is thrashing your heap.

answered Apr 11 '15 at 17:10

Qartar

400
1
9

this will create a great block of memory in my RAM this is a memory killer. – Gilad Apr 11 '15 at 17:12
1

It will consume less RAM and be less fragmented than the individual allocation version. – Puppy Apr 11 '15 at 17:14
3

You are allocating _more_ memory your way since each allocation requires its own header for the heap manager. – Qartar Apr 11 '15 at 17:15
qartar I'm allocating more space but i can address my matrix without all the pointer mathematics i need to do on this solution. Anyway the best solution for me now is std::vector>> is it crazy fast. thanks anyway I gave you +1 – Gilad Apr 11 '15 at 20:22
1

@Gilad: Allocation as a single block is not exclusive with multiple dimension subscripting. See http://stackoverflow.com/a/29375830/103167 – Ben Voigt Apr 11 '15 at 20:32
@Gilad: Normally, you make a wrapper class that does the pointer arithmetic, so you only have to do that once, and everywhere else you just call the function. – Apr 11 '15 at 21:40

score 2 · Answer 2 · edited May 23 '17 at 12:10

I agree with qartar's answer right up until he said "index into it using offsets". That isn't necessary. You can have your single allocation and multiple subscript access (imageData[i][j][k]) too. I previously showed this method here, it's not difficult to adapt it for the 3-D case:

allocation code as follows:

double*** imageData;
imageData = new double**[width];
imageData[0] = new double*[width * height];
imageData[0][0] = new double[width * height * 4];
for (int i = 0; i < width; i++) {
    if (i > 0) {
        imageData[i] = imageData[i-1] + height;
        imageData[i][0] = imageData[i-1][0] + height * 4;
    }
    for (int j = 1; j < height; j++) {
        imageData[i][j] = imageData[i][j-1] + 4;
    }
}

Deallocation becomes simpler:

delete[] imageData[0][0];
delete[] imageData[0];
delete[] imageData;

Of course, you can and should use std::vector to do the deallocation automatically:

std::vector<double**> imageData(width);
std::vector<double*> imageDataRows(width * height);
std::vector<double> imageDataCells(width * height * 4);
for (int i = 0; i < width; i++) {
    imageData[i] = &imageDataRows[i * height];
    for (int j = 0; j < height; j++) {
        imageData[i][j] = &imageDataCells[(i * height + j) * 4];
    }
}

and deallocation is completely automatic.

See my other answer for more explanation.

Or use std::array<double,4> for the last subscript, and use 2-D dynamic allocation via this method.

@Gilad: Probably so, but it still has a lot of separately allocated blocks, so contiguity is not great and allocator overhead is higher. I'm showing a way to do it with only three allocations. The linked question shows another benefit, that you can also treat the data as one large block for e.g. saving to disk in one call. — Ben Voigt, Apr 12 '15 at 17:45

score 2 · Answer 3 · edited May 23 '17 at 10:24

A slight variation on the first idea of Ben Voigt's answer:

double ***imagedata = new double**[height];
double **p = new double*[height * width];
double *q = new double[height * width * length];
for (int i = 0; i < height; ++i, p += width) {
    imagedata[i] = p;
    for (int j = 0; j < width; ++j, q += length) {
        imagedata[i][j] = q;
    }
}
// ...
delete[] imagedata[0][0];
delete[] imagedata[0];
delete[] imagedata;

It is possible to do the whole thing with a single allocation, but that would introduce a bit of complexity that you might not want to pay.

Now, the fact that each table lookup involves a couple of back-to-back reads of pointers from memory, this solution will pretty much always be quite inferior to allocating a flat array, and doing index calculations to convert a triple of indices into one flat index (and you should write a wrapper class that does these index calculations for you).

The main reason to use arrays of pointers to arrays of pointers to arrays is when your array is ragged — that is, imagedata[a][b] and imagedata[c][d] have different lengths — or maybe for swapping rows around, such as swap(imagedata[a][b], imagedata[c][d]). And under these circumstances, vector as you've used it is preferable to use until proven otherwise.

@Ben: Oh, I know what happened; when skimming through I saw your use of vectors and mostly ignored the content, mentally filling in what I thought a vector-like solution would be. :( (will delete shortly) — , Apr 12 '15 at 19:16
Nah don't delete. I like your arithmetic better. Just indicate that the answers are similar. — Ben Voigt, Apr 12 '15 at 19:18

WhozCraig · Accepted Answer · 2015-04-13T15:41:44.453

The primary portion of your algorithm that is killing performance is the granularity and sheer number of allocations you're making. In total you're producing 3001501 broken down as:

1 allocation for 1500 double**
1500 allocations, each of which obtains 2000 double*
3000000 allocations each of which obtains double[4]

This can be considerably reduced. You can certainly do as other suggest and simply allocate 1 massive array of double, leaving the index calculation to accessor functions. Of course, if you do that you need to ensure you bring the sizes along for the ride. The result, however, will easily deliver the fastest allocation time and access performance. Using a std::vector<double> arr(d1*d2*4); and doing the offset math as needed will serve very well.

Another Way

If you are dead set on using a pointer array approach, you can eliminate the 3000000 allocations by obtaining both of the inferior dimensions in single allocations. Your most-inferior dimension is fixed (4), thus you could do this: (but you'll see in a moment there is a much more C++-centric mechanism):

double (**allocPtrsN(size_t d1, size_t d2))[4]
{
    typedef double (*Row)[4];
    Row *res = new Row[d1];

    for (size_t i=0; i<d1; ++i)
        res[i] = new T[d2][4];

    return res;
}

and simply invoke as:

double (**arr3D)[4] = allocPtrsN(d1,d2);

where d1 and d2 are your two superior dimensions. This produces exactly d1 + 1 allocations, the first being d1 pointers, the remaining be d1 allocations, one for each double[d2][4].

Using C++ Standard Containers

The prior code is obviously tedious, and frankly prone to considerable error. C++ offers a tidy solution this using a vector of vector of fixed array, doing this:

std::vector<std::vector<std::array<double,4>>> arr(1500, std::vector<std::array<double,4>>(2000));

Ultimately this will do nearly the same allocation technique as the rather obtuse code shown earlier, but provide you all the lovely benefits of the standard library while doing it. You get all those handy members of the std::vector and std::array templates, and RAII features as an added bonus.

However, this is one significant difference. The raw pointer method shown earlier will not value-initialize each allocated entity; the vector of vector of array method will. If you think it doesn't make a difference...

#include <iostream>
#include <vector>
#include <array>
#include <chrono>

using Quad = std::array<double, 4>;
using Table = std::vector<Quad>;
using Cube = std::vector<Table>;

Cube allocCube(size_t d1, size_t d2)
{
    return Cube(d1, Table(d2));
}

double ***allocPtrs(size_t d1, size_t d2)
{
    double*** ptrs = new double**[d1];
    for (size_t i = 0; i < d1; i++)
    {
        ptrs[i] = new double*[d2];
        for (size_t j = 0; j < d2; j++)
        {
            ptrs[i][j] = new double[4];
        }
    }
    return ptrs;
}

void freePtrs(double***& ptrs, size_t d1, size_t d2)
{
    for (size_t i=0; i<d1; ++i)
    {
        for (size_t j=0; j<d2; ++j)
            delete [] ptrs[i][j];
        delete [] ptrs[i];
    }
    delete [] ptrs;
    ptrs = nullptr;
}

double (**allocPtrsN(size_t d1, size_t d2))[4]
{
    typedef double (*Row)[4];
    Row *res = new Row[d1];

    for (size_t i=0; i<d1; ++i)
        res[i] = new double[d2][4];

    return res;
}

void freePtrsN(double (**p)[4], size_t d1, size_t d2)
{
    for (size_t i=0; i<d1; ++i)
        delete [] p[i];
    delete [] p;
}

std::vector<std::vector<std::array<double,4>>> arr(1500, std::vector<std::array<double,4>>(2000));

template<class C>
void print_duration(const std::chrono::time_point<C>& beg,
                    const std::chrono::time_point<C>& end)
{
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << "ms\n";
}

int main()
{
    using namespace std::chrono;
    time_point<system_clock> tp;
    volatile double vd;

    static constexpr size_t d1 = 1500, d2 = 2000;

    tp = system_clock::now();
    for (int i=0; i<10; ++i)
    {
        double ***cube = allocPtrs(d1,d2);
        cube[d1/2][d2/21][1] = 1.0;
        vd = cube[d1/2][d2/2][3];
        freePtrs(cube, 1500, 2000);
    }
    print_duration(tp, system_clock::now());

    tp = system_clock::now();
    for (int i=0; i<10; ++i)
    {
        Cube cube = allocCube(1500,2000);
        cube[d1/2][d2/21][1] = 1.0;
        vd = cube[d1/2][d2/2][3];
    }
    print_duration(tp, system_clock::now());

    tp = system_clock::now();
    for (int i=0; i<10; ++i)
    {
        auto cube = allocPtrsN(d1,d2);
        cube[d1/2][d2/21][1] = 1.0;
        vd = cube[d1/2][d2/21][1];
        freePtrsN(cube, d1, d2);
    }
    print_duration(tp, system_clock::now());
}

Output

5328ms
418ms
95ms

Thusly, if you're planning on loading up every element with something besides zero anyway, it is something to keep in mind.

Conclusion

If performance were critical I would use the 24MB (on my implementation, anyway) single-allocation, likely in a std::vector<double> arr(d1*d2*4);, and do the offset calculations as needed using one form of secondary indexing or another. Other answers proffer up interesting ideas on this, notably Ben's, which radically reduces the allocation count two a mere three blocks (data, and two secondary pointer arrays). Sorry, I didn't have time to bench it, but I would suspect the performance would be stellar. But if you really want to keep your existing technique, consider doing it in a C++ container as shown above. If the extra cycles spent value initializing the world aren't too heavy a price to pay, it will be much easier to manage (and obviously less code to deal with in comparison to raw pointers).

Best of luck.

slow performance for 3D array delete C++

4 Answers4

Linked