0

CUDA documentation recommends the use of cudaMemCpy2D() for 2D arrays (and similarly cudaMemCpy3D() for 3D arrays) instead of cudaMemCpy() for better performance as the former allocates device memory more appropriately. On the other hand, all cudaMemCpy functions, just like memcpy(), require contiguous allocation of memory.

This is all fine if I create my (host) array as, for example, float myArray[h][w];. However, it most likely will not work if I use something like:

float** myArray2 = new float*[h];
for( int i = 0 ; i < h ; i++ ){
   myArray2[i] = new float[w];
}

This is not a big problem except when one is trying to implement CUDA into an existing project, which is the problem I am facing. Right now, I create a temporary 1D array, copy the contents of my 2D array into it and use cudaMemCpy() and repeat the whole process to get the results after the kernel launch, but this does not seem an elegant/efficient way.

Is there a better way to handle this situation? Specifically, is there a way to create a genuine 2D array on the heap with contiguously allocated rows so that I can use cudaMemCpy2D()?

P.S: I couldn't find the answer to this question the following previous similar posts:

Community
  • 1
  • 1
S.G.
  • 357
  • 2
  • 15
  • It's not clear to me why your second link is not the solution. – Anon Mail Nov 03 '15 at 17:10
  • @AnonMail, I could be wrong but in that question a container is defined (similar to std::vector). Internally it uses a 1D array to achieve contiguous allocation. Also similar to std::vector and std::map one cannot directly access elements of the container using pointers and should use iterators instead. I doubt one can copy the contents of such objects using memcpy(). – S.G. Nov 03 '15 at 17:29
  • @RobertCrovella, thanks for the comment. That (flattening the 2D array manually prior to HostToDevice copy) is exactly what I am doing right now in my application. I was hoping to find a way to do it differently so I can take advantage of more efficient memory allocation of cudaMemCpy2D(). Looks like this is the only way. – S.G. Nov 03 '15 at 17:36
  • I was assuming you did not want to modify the host (allocation) code *at all*. But that is not correct. Actually you are willing to modify the host code, but you want to preserve doubly-subscripted access on the host. In that case, the answer by @DaleWilson is a great suggestion. – Robert Crovella Nov 03 '15 at 17:38

1 Answers1

2

Allocate the big array, then use pointer arithmetic to find the actual beginnings of the rows.

float* bigArray = new float[h * w]
float** myArray2 = new float*[h]
for( int i = 0 ; i < h ; i++ ){
   myArray2[i] = &bigArray[i * w];
}

Your myArray2 array of pointers gives you C/C++ style two dimensional arrays behavior, bigArray gives you the contiguous block of memory needed by CUDA.

Dale Wilson
  • 9,166
  • 3
  • 34
  • 52
  • Thanks, @Dale Wilson. Just to clarify, this way I can pass myArray2 to cudaMemCpy2D(), right? – S.G. Nov 03 '15 at 17:31
  • Note that I edited my post to make the first line new float[h*w] rather than new float*[h*w]. sorry for the typo. Now about your question: You should probably use bigArray to pass the array to CUDA, but you could also use myArray[0]. In any case you need a pointer to the contiguous array of floats, not a pointer to the pointer to the array which is what passing myArray2 would give you. – Dale Wilson Nov 03 '15 at 17:36
  • Great! That's exactly what I was hoping for. Thank you. – S.G. Nov 03 '15 at 17:40
  • @RobertCrovella, I do not. Could please elaborate a bit more as to why I should not use myArray[0]? – S.G. Nov 03 '15 at 17:47
  • You can use `myArray2[0]`. In my now-deleted comment, I said you should use `bigArray` not `myArray2`. But you can also use `myArray2[0]`, since by inspection that is identical to `bigArray`. – Robert Crovella Nov 03 '15 at 17:50