efficiently transferring multidimensional array to CUDA GPU

Question

How would one transfer a (kind of) multidimensional array defined similar to an array "A" (i.e.

int********* A;

) of convert multidimensional array to single dimensional in C to CUDA GPU efficiently? Thanks!

Sorry, but this is a nonsense question. Nobody would *ever* construct a multidimensional array in this way, and you clearly are not doing so. The expertise at your disposal here at Stack Overflow is a precious and finite resource, please think carefully before squandering it asking frivolous, meaningless questions like this one. — talonmies, Oct 23 '12 at 20:44
@talonmies I would not be asking a question if I would not come across such a situation. — starter, Oct 24 '12 at 18:22
@talonmies Just a suggestion for future, never use word "nobody" to prove something... Best. — starter, Oct 24 '12 at 18:38
So you are seriously suggesting you have a multidimensional array allocated with *8 nested levels of malloc calls* and accessed by value with *8 pointer indirections* ? Perhaps I should have used the expression "nobody with even the faintest idea of what they are doing", rather than "nobody". For that I apologise. But if you created and allocate such an array, surely it must be self evident how to flatten it? — talonmies, Oct 25 '12 at 18:15

Robert Crovella · Accepted Answer · 2020-06-01T13:03:53.313

Since you've edited your question, I'll edit my response. Such an array (* *******A) is rather difficult to create. It requires nested loops with malloc, where the nesting level is equal to the array dimensionality. Having said that, the response is similar to what I have already posted below. Either you have a parallel set of nested loops that are doing the cudaMalloc and cudaMemcpy along the way, or else you linearize the whole thing and transfer in one step. For a two-dimensional array, I could possibly consider suggesting either approach. For an N-dimensional array, the first method is simply madness, as illustrated in this sequence of SO questions. Therefore, I think you should certainly linearize a large dimensional varying-row array before trying to transfer it to the device. The method of linearization is asked in the previous question you refer to and is outside of the scope of my answer here. Once linearized, the transfer operation is straightforward, and can be done with a single cudaMalloc/cudaMemcpy operation.

Presumably you are referring to arrays where the individual rows have different sizes (and are therefore malloc'ed independently). I think you have 2 choices:

Transfer the rows independently, with corresponding cudaMalloc (for each row malloc) and a cudaMemcpy (for each cudaMalloc).
Combine (pack) the rows in host memory, so as to create one contiguous block that is the size of the overall data set (the sum of the row sizes). Then, using a single cudaMemcpy, transfer this "packed" array to the device in one step. From a transfer efficiency standpoint, this will be most efficient.

In either case, you will have to carefully consider the access mechanism to make the array conveniently available on the GPU. The first method may be easier in this respect, since you will automatically have pointers for each row. For the second method, you may need to create a set of pointers on the device to match your row pointers on the host. Beyond that, your access mechanism on the device should be similar to the host, since either will use a set of row pointers to access your array.

If instead you are referring to the ordinary multidimensional array (a[dim1][dim2][dim3]...) that is straightforward since it is already all contiguous in memory and accessible with a single pointer. If you remake the original varying-rows array as an ordinary multidimensional array whose number of columns is equal to the longest row (therefore leaving some elements unused in other rows), you could take advantage of this technique instead. This will have some inefficiency because you are transferring unused elements, but accessing the array would be straightforward.

If you have truly sparse matrices, you might also want to consider sparse matrix representation methods. cusp would be one method for handling and manipulating these on the GPU.

This answer may also be of interest.

efficiently transferring multidimensional array to CUDA GPU

1 Answers1