Is passing and operating on an array with more than one dimension in CUDA impossible, or is it not done for performance reasons?

Question

Wherever someone asks a question on multi-dimensional arrays in CUDA on StackOverflow or nVidia forum, the answer is more or less like the following:

Allocate 2D Array on Device Memory in CUDA

Please, flatten the multidimensional array to 1D.

Here is an example solution:
//Implementation of a CUDA program using a 1D array.
... ... ...

I am perplexed about this.

Is passing and operating on an array with more than one dimension in CUDA impossible, or is it not done for performance reasons?

You really have to provide some citations for this if you want a meaningful discussion. Seeing how practically all of CUDA's memory and thread index management is built around 3-dimensional arrays I'd say this conclusion is pretty definitely wrong. If you're bothered by manual index calculation as in `(z * ydim + y) * xdim + x` your beef should be with C/C++ where this is how you always do it under the hood for dynamic arrays — Homer512, Mar 26 '23 at 22:44
Did you carefully read the first (accepted and well upvoted) answer of the [question on witch you put a bounty](https://stackoverflow.com/questions/1047369/allocate-2d-array-on-device-memory-in-cuda)? The answer is for 2D arrays. It does not flatten the target array. I guess you expect 2D array to be of type `Type**`, but this is a very inefficient way to store 2D array (on both CPUs and GPUs, but especially GPUs). This is called jagged arrays. They are only useful in few cases or when the sub-arrays are of different size. — Jérôme Richard, Mar 26 '23 at 23:24
@JérômeRichard, *I guess you expect 2D array to be of type `Type**`, but this is a very inefficient way to store 2D array (on both CPUs and GPUs, but especially GPUs).* --- Yes, this is exactly what I wanted to know. Please, post an answer if you can. — user366312, Mar 26 '23 at 23:28
What you mean is "whenever someone asks a question on passing arrays of pointers to use as multi-dimensional arrays in CUDA......". A multidimensional array and an array of pointers are two different things which people (wrongly) try and use interchangeably in C and C++ like languages, CUDA included — talonmies, Mar 27 '23 at 07:48
Cuda threads within a warp (group of 32 threads) also cooperate for memory accesses and should as far as possible access continuous and aligned addresses. This can be much better predicted and ensured for 'Robert's' efficient arrays. — Sebastian, Mar 28 '23 at 12:39

Robert Crovella · Accepted Answer · 2023-03-31T22:59:08.827

A multi-dimensional array definition in C++ falls into one of two categories.

The first category is when the compiler knows or can discover the array width at compile time. In this case, for multiply-subscripted access e.g. a[x][y], the compiler knows the width of a (i.e. the number of elements in the dimension corresponding to the last subscript), and "under the hood" will generate indexing like this:

*(a+x*awidth+y)

where all those items mean exactly what they would mean in C++: a is the "name" of the array, which decays to a pointer when doing this type of pointer arithmetic, x is the first subscript, y is the second subscript, and awidth is that width that the compiler has discovered at compile time for a. According to C++ definitions, a is an array.

In this case, a when allowed to "decay" to a pointer, will point to a location that contains not a pointer, but an element of the same type as array a.

In this category, only a single access to memory is generated, and it retrieves the element of a in question. I would call this the "efficient" case, although that is subjective on other matters not fully defined or discussed here/yet, so I don't wish to argue the point.

Another type of multiply-subscripted array can be constructed using "pointer chasing". In this case, a is not an array, it is a double-pointer:

T **a;

In this case, a points to a location that contains a pointer to a type of the element of array a (more typically, a points to the first element of an array of "row-pointers"), therefore I refer to a as a "double-pointer" and formally, a does not name an array. However, in typical usage, we can reference an element of a in the same syntax: a[x][y]. The compiler "under the hood" does not generate the pointer arithmetic that we previously covered, instead it generates two successive pointer dereference operations:

dereference a+x(i.e. perform *(a+x)) to retrieve another pointer, let's call that retrieved pointer q (effectively one of the "row-pointers" of a)
dereference q+y (i.e. perform *(q+y)), to retrieve the desired element of a

In the above we see that two memory operations are required. The first retrieves a pointer, the second retrieves the element in question.

I would call this the "inefficient" method. The first method is formally using an "array", this second method is not, but in typical doubly-subscripted usage, they appear syntactically identical.

All of this discussion applies to C++ compiler behavior, and is not unique or specific to CUDA.

If you are careful to make it so that the CUDA compiler can discover your array width at compile time (not always possible), then it will behave the same in terms of pointer arithmetic, pointer dereferencing, and number of memory operations involved, as what I have described for the first case above. Otherwise if you declare your "array" using the second method, you will get multiple memory operations per access (of course, the compiler may also cache the results of the first dereference operation, which may provide some benefit, but this is not guaranteed behavior in all cases, and only provides a benefit in a situation of some reuse, not in the general case.)

This answer provides a survey of 2D and 3D examples in CUDA, along with some discussion of tradeoffs.

For people who are or must be using the second type, then flattening the array is always possible, and using the flattened alternative will result in an access methodology that can behave roughly as the first method above: only one access required, using typical pointer arithmetic.

I'll only briefly mention performance. We could compare the two cases by saying that the second case involves two memory accesses (that are not necessarily "near" each other) and the first case involves a single memory access plus possibly some arithmetic.

GPUs generally have arithmetic-to-memory access ratios that are much greater than 1. As a result, for each memory access, a GPU has the capacity to do many arithmetic operations. Therefore, the first case, in this simplistic comparison, may be "more efficient" on a GPU, as it uses GPU resources in a somewhat more balanced fashion. GPUs often have "compute to spare" but many algorithms are memory-bound. In the memory bound case, cutting your required memory accesses by as much as half (while perhaps increasing the integer compute load) could result in substantial speed-up.

Jagged arrays have other drawbacks beyond double-dereferencing; I recommend not using them (on both host and device) unless their flexibility is required. While normal multi-dimensional array in C++ occupies one single contiguous block of memory that can easily be copied in a single copy operation, a jagged array comprising individually allocated row (or column) vectors requires a deep copy with one copy operation per row/column. Because the underlying storage of a jagged array is in generally not contiguous, it also becomes near impossible to reason about caching behavior when traversing it. — njuffa, Mar 27 '23 at 02:20
Just so people don't assume there is a 1:1 correspondence between "jagged arrays" and my second category in my answer, "arrays" of the 2nd type in my answer need not necessarily/always have non-contiguous underlying storage, and therefore need not always require a deep-copy. This is an involved topic and there are variations or sub-categories on each of the two categories I have described, as is suggested in the linked survey answer. However, at a first level of hierarchy of categorization, all multiply-subscripted access provided exclusively by C++ can be broken into those two categories. — Robert Crovella, Mar 31 '23 at 23:04

Is passing and operating on an array with more than one dimension in CUDA impossible, or is it not done for performance reasons?

1 Answers1

Linked