The primary portion of your algorithm that is killing performance is the granularity and sheer number of allocations you're making. In total you're producing 3001501 broken down as:
- 1 allocation for 1500
double**
- 1500 allocations, each of which obtains 2000
double*
- 3000000 allocations each of which obtains
double[4]
This can be considerably reduced. You can certainly do as other suggest and simply allocate 1 massive array of double, leaving the index calculation to accessor functions. Of course, if you do that you need to ensure you bring the sizes along for the ride. The result, however, will easily deliver the fastest allocation time and access performance. Using a std::vector<double> arr(d1*d2*4);
and doing the offset math as needed will serve very well.
Another Way
If you are dead set on using a pointer array approach, you can eliminate the 3000000 allocations by obtaining both of the inferior dimensions in single allocations. Your most-inferior dimension is fixed (4), thus you could do this: (but you'll see in a moment there is a much more C++-centric mechanism):
double (**allocPtrsN(size_t d1, size_t d2))[4]
{
typedef double (*Row)[4];
Row *res = new Row[d1];
for (size_t i=0; i<d1; ++i)
res[i] = new T[d2][4];
return res;
}
and simply invoke as:
double (**arr3D)[4] = allocPtrsN(d1,d2);
where d1
and d2
are your two superior dimensions. This produces exactly d1 + 1
allocations, the first being d1
pointers, the remaining be d1
allocations, one for each double[d2][4]
.
Using C++ Standard Containers
The prior code is obviously tedious, and frankly prone to considerable error. C++ offers a tidy solution this using a vector of vector of fixed array, doing this:
std::vector<std::vector<std::array<double,4>>> arr(1500, std::vector<std::array<double,4>>(2000));
Ultimately this will do nearly the same allocation technique as the rather obtuse code shown earlier, but provide you all the lovely benefits of the standard library while doing it. You get all those handy members of the std::vector
and std::array
templates, and RAII features as an added bonus.
However, this is one significant difference. The raw pointer method shown earlier will not value-initialize each allocated entity; the vector of vector of array method will. If you think it doesn't make a difference...
#include <iostream>
#include <vector>
#include <array>
#include <chrono>
using Quad = std::array<double, 4>;
using Table = std::vector<Quad>;
using Cube = std::vector<Table>;
Cube allocCube(size_t d1, size_t d2)
{
return Cube(d1, Table(d2));
}
double ***allocPtrs(size_t d1, size_t d2)
{
double*** ptrs = new double**[d1];
for (size_t i = 0; i < d1; i++)
{
ptrs[i] = new double*[d2];
for (size_t j = 0; j < d2; j++)
{
ptrs[i][j] = new double[4];
}
}
return ptrs;
}
void freePtrs(double***& ptrs, size_t d1, size_t d2)
{
for (size_t i=0; i<d1; ++i)
{
for (size_t j=0; j<d2; ++j)
delete [] ptrs[i][j];
delete [] ptrs[i];
}
delete [] ptrs;
ptrs = nullptr;
}
double (**allocPtrsN(size_t d1, size_t d2))[4]
{
typedef double (*Row)[4];
Row *res = new Row[d1];
for (size_t i=0; i<d1; ++i)
res[i] = new double[d2][4];
return res;
}
void freePtrsN(double (**p)[4], size_t d1, size_t d2)
{
for (size_t i=0; i<d1; ++i)
delete [] p[i];
delete [] p;
}
std::vector<std::vector<std::array<double,4>>> arr(1500, std::vector<std::array<double,4>>(2000));
template<class C>
void print_duration(const std::chrono::time_point<C>& beg,
const std::chrono::time_point<C>& end)
{
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << "ms\n";
}
int main()
{
using namespace std::chrono;
time_point<system_clock> tp;
volatile double vd;
static constexpr size_t d1 = 1500, d2 = 2000;
tp = system_clock::now();
for (int i=0; i<10; ++i)
{
double ***cube = allocPtrs(d1,d2);
cube[d1/2][d2/21][1] = 1.0;
vd = cube[d1/2][d2/2][3];
freePtrs(cube, 1500, 2000);
}
print_duration(tp, system_clock::now());
tp = system_clock::now();
for (int i=0; i<10; ++i)
{
Cube cube = allocCube(1500,2000);
cube[d1/2][d2/21][1] = 1.0;
vd = cube[d1/2][d2/2][3];
}
print_duration(tp, system_clock::now());
tp = system_clock::now();
for (int i=0; i<10; ++i)
{
auto cube = allocPtrsN(d1,d2);
cube[d1/2][d2/21][1] = 1.0;
vd = cube[d1/2][d2/21][1];
freePtrsN(cube, d1, d2);
}
print_duration(tp, system_clock::now());
}
Output
5328ms
418ms
95ms
Thusly, if you're planning on loading up every element with something besides zero anyway, it is something to keep in mind.
Conclusion
If performance were critical I would use the 24MB (on my implementation, anyway) single-allocation, likely in a std::vector<double> arr(d1*d2*4);
, and do the offset calculations as needed using one form of secondary indexing or another. Other answers proffer up interesting ideas on this, notably Ben's, which radically reduces the allocation count two a mere three blocks (data, and two secondary pointer arrays). Sorry, I didn't have time to bench it, but I would suspect the performance would be stellar. But if you really want to keep your existing technique, consider doing it in a C++ container as shown above. If the extra cycles spent value initializing the world aren't too heavy a price to pay, it will be much easier to manage (and obviously less code to deal with in comparison to raw pointers).
Best of luck.