A multi-dimensional array definition in C++ falls into one of two categories.
The first category is when the compiler knows or can discover the array width at compile time. In this case, for multiply-subscripted access e.g. a[x][y]
, the compiler knows the width of a
(i.e. the number of elements in the dimension corresponding to the last subscript), and "under the hood" will generate indexing like this:
*(a+x*awidth+y)
where all those items mean exactly what they would mean in C++: a
is the "name" of the array, which decays to a pointer when doing this type of pointer arithmetic, x
is the first subscript, y
is the second subscript, and awidth
is that width that the compiler has discovered at compile time for a
. According to C++ definitions, a
is an array.
In this case, a
when allowed to "decay" to a pointer, will point to a location that contains not a pointer, but an element of the same type as array a
.
In this category, only a single access to memory is generated, and it retrieves the element of a
in question. I would call this the "efficient" case, although that is subjective on other matters not fully defined or discussed here/yet, so I don't wish to argue the point.
Another type of multiply-subscripted array can be constructed using "pointer chasing". In this case, a
is not an array, it is a double-pointer:
T **a;
In this case, a
points to a location that contains a pointer to a type of the element of array a
(more typically, a
points to the first element of an array of "row-pointers"), therefore I refer to a
as a "double-pointer" and formally, a
does not name an array. However, in typical usage, we can reference an element of a
in the same syntax: a[x][y]
. The compiler "under the hood" does not generate the pointer arithmetic that we previously covered, instead it generates two successive pointer dereference operations:
- dereference
a+x
(i.e. perform *(a+x)
) to retrieve another pointer, let's call that retrieved pointer q
(effectively one of the "row-pointers" of a
)
- dereference
q+y
(i.e. perform *(q+y)
), to retrieve the desired element of a
In the above we see that two memory operations are required. The first retrieves a pointer, the second retrieves the element in question.
I would call this the "inefficient" method. The first method is formally using an "array", this second method is not, but in typical doubly-subscripted usage, they appear syntactically identical.
All of this discussion applies to C++ compiler behavior, and is not unique or specific to CUDA.
If you are careful to make it so that the CUDA compiler can discover your array width at compile time (not always possible), then it will behave the same in terms of pointer arithmetic, pointer dereferencing, and number of memory operations involved, as what I have described for the first case above. Otherwise if you declare your "array" using the second method, you will get multiple memory operations per access (of course, the compiler may also cache the results of the first dereference operation, which may provide some benefit, but this is not guaranteed behavior in all cases, and only provides a benefit in a situation of some reuse, not in the general case.)
This answer provides a survey of 2D and 3D examples in CUDA, along with some discussion of tradeoffs.
For people who are or must be using the second type, then flattening the array is always possible, and using the flattened alternative will result in an access methodology that can behave roughly as the first method above: only one access required, using typical pointer arithmetic.
I'll only briefly mention performance. We could compare the two cases by saying that the second case involves two memory accesses (that are not necessarily "near" each other) and the first case involves a single memory access plus possibly some arithmetic.
GPUs generally have arithmetic-to-memory access ratios that are much greater than 1. As a result, for each memory access, a GPU has the capacity to do many arithmetic operations. Therefore, the first case, in this simplistic comparison, may be "more efficient" on a GPU, as it uses GPU resources in a somewhat more balanced fashion. GPUs often have "compute to spare" but many algorithms are memory-bound. In the memory bound case, cutting your required memory accesses by as much as half (while perhaps increasing the integer compute load) could result in substantial speed-up.