Performace of large 3D arrays: contiguous 1D storage vs T***

Question

I wonder if anyone could advise on storage of large (say 2000 x 2000 x 2000) 3D arrays for finite difference discretization computations. Does contiguous storage float* give better performance then float*** on modern CPU architectures? Here is a simplified example of computations, which are done over entire arrays:

for i ...
    for j ...
        for k ...
            u[i][j][k] += v[i][j][k+1] + v[i][j][k-1]
                + v[i][j+1][k] + v[i][j-1][k] + v[i+1][j][k] + v[i-1][j][k];

Vs

    u[i * iStride + j * jStride + k] += ...

PS: Considering size of problems, storing T*** is a very small overhead. Access is not random. Moreover, I do loop blocking to minimize cache misses. I am just wondering how triple dereferencing in T*** case compares to index computation and single dereferencing in case of 1D array.

Unless your compiler sucks, they should be she same. Consider doing the computation on the GPU thou. — sp2danny, May 08 '14 at 12:49
If I were you, I'd just implement both variants and measure. Such decisions really depend on the actual data/computation, in my experience. — lethal-guitar, May 08 '14 at 13:02

score 2 · Accepted Answer · answered May 08 '14 at 12:51

2

These are not apples-to-apples comparisons: a flat array is just that - a flat array, which your code partitions into segments according to some logic of linearizing a rectangular 3D array. You access an element of an array with a single dereference, plus a handful of math operations.

float***, on the other hand, lets you keep a "jagged" array of arrays or arrays, so the structure that you can represent inside such an array is a lot more flexible. Naturally, you need to pay for that flexibility with additional CPU cycles required for dereferencing pointers to pointers to pointers, then a pointer to pointer, and finally a pointer (the three pairs of square brackets in the code).

Naturally, access to the individual elements of float*** is going to be a little slower, if you access them in truly random order. However, if the order is not random, the difference that you see may be tiny, because the values of pointers would be cached.

float*** will also require more memory, because you need to allocate two additional levels of pointers.

answered May 08 '14 at 12:51

Sergey Kalinichenko

714,442
84
1,110
1,523

Considering size of problems, storing `float***` is a very small overhead. Access is not random. Moreover, I do loop blocking to minimize cache misses. I am just wondering how triple dereferencing in `T***` case compares to index computation and single dereferencing in case of 1D array – user2052436 May 08 '14 at 13:00
@user2052436 The most expensive part of dereference is reading the pointer. Other than that, it's only an addition. As far as cache access optimization goes, I think you are in a much better position to do it when you use a flat array representation. – Sergey Kalinichenko May 08 '14 at 13:07
@user2052436: Is the size of the array static? If so, compile-time will always be fastest, though perhaps not significantly. Cache issues are mostly the same, assuming that you've made efforts to use the same layout. The issue here is going to be aliasing. In other words, will the compiler need to reload the pointer(s) on every loop iteration? – kec May 08 '14 at 13:08

score 2 · Answer 2 · answered May 08 '14 at 13:05

The short answer is: benchmark it. If the results are inconclusive, it means it doesn't matter. Do what makes your code most readable.

As @dasblinkenlight has pointed out, the structures are nor equivalent because T*** can be jagged.

At the most fundamental level, though, this comes down to the arithmetic and memory access operations.

For your 1D array, as you have already (almost) written, the calculation is:

 ptr = u + (i * iStride) + (j * jStride) + k
 read *ptr

With T***:

 ptr = u + i
 x = read ptr
 ptr = x + j
 y = read ptr
 ptr = y + k
 read ptr

So you are trading two multiplications for two memory accesses.

score 1 · Answer 3 · answered May 08 '14 at 13:18

In computer go, where people are very performance-sensitive, everyone (AFAIK) uses T[361] rather than T[19][19] (*). This decision is based on benchmarking, both in isolation and the whole program. (It is possible everyone did those benchmarks years and years ago, and have never done them again on the latest hardware, but my hunch is that a single 1-D array will still be better.)

However your array is huge, in comparison. As the code involved in each case is easy to write, I would definitely try it both ways and benchmark.

*: Aside: I think it is actually T[21][21] vs. t[441], in most programs, as an extra row all around is added to speed up board-edge detection.

score 0 · Answer 4 · edited May 23 '17 at 12:20

One issue that has not been mentioned yet is aliasing.

Does your compiler support some type of keyword like restrict to indicate that you have no aliasing? (It's not part of C++11 so would have to be an extension.) If so, performance may be very close to the same. If not, there could be significant differences in some cases. The issue will be with something like:

for (int i = ...) {
    for (int j = ...) {
        a[j] = b[i];
    }
}

Can b[i] be loaded once per outer loop iteration and stored in a register for the entire inner loop? In the general case, only if the arrays don't overlap. How does the compiler know? It needs some type of restrict keyword.

Performace of large 3D arrays: contiguous 1D storage vs T***

4 Answers4