JCuda: copy multidimensional array from device to host

Question

I've been working with JCuda for some months now and I can't copy a multidimensional array from device memory to host memory. The funny thing is that I have no problems in doing so in the opposite direction (I can invoke my kernel with multidimensional arrays and everything works with the correct values).

In a few words, I put the results of my kernel in a bi-dimensional array of shorts, where the first dimension of such array is the number of threads, so that each one can write in different locations.

Here an example:

CUdeviceptr pointer_dev = new CUdeviceptr();
cuMemAlloc(pointer_dev, Sizeof.POINTER); // in this case, as an example, it's an array with one element (one thread), but it doesn't matter

// Invoke kernel with pointer_dev as parameter. Now it should contain some results

CUdeviceptr[] arrayPtr = new CUdeviceptr[1]; // It will point to the result
arrayPtr[0] = new CUdeviceptr();
short[] resultArray = new short[3]; // an array of 3 shorts was allocated in the kernel

cuMemAlloc(arrayPtr[0], 3 * Sizeof.SHORT);
cuMemcpyDtoH(Pointer.to(arrayPtr), pointer_dev, Sizeof.POINTER); // Its seems, using the debugger, that the value of arrayPtr[0] isn't changed here!
cuMemcpyDtoH(Pointer.to(resultArray), arrayPtr[0], 3 * Sizeof.SHORT); // Not the expected values in resultArray, probably because of the previous instruction

What am I doing wrong?

EDIT:

Apparently, there are some limitations that doesn't allow device allocated memory to be copied back to host, as stated in this (and many more) threads: link

Any workaround? I'm using CUDA Toolkit v5.0

There are at least 2 possible issues. One is the one you stated, that memory allocated using device functions like `malloc` or `new` cannot be directly copied to the host. There are also various challenges that are associated with copying dynamically allocated data that contains pointers to other dynamically allocated data, and there are plenty of questions around that on SO as well. Unfortunately I'm not familiar enough with java or JCUDA syntax in order to tell you exactly how to fix this. — Robert Crovella, Aug 29 '13 at 15:00
A possible workaround for the edit question you post is simply to copy data from device-allocated regions to host-allocated regions in your device code, before copying it back to the host. — Robert Crovella, Aug 29 '13 at 15:00
Thanks, that is what I did, too. But a problem persists: the size of the host-allocated regions are fixed before the kernel computation, and I have no way to predetermine what the dimension of the output will be. For now I've decided to establish a fixed size big enough to work in most cases, but it's not a "good" solution, I think. — Rorrim, Sep 03 '13 at 09:12

score 4 · Accepted Answer · edited May 15 '18 at 19:44

Here we are copying a two dimensional array of integers from the device to host.

First, create a single dimensional array with size equal to size of another single dimension array (here blockSizeX).

CUdeviceptr[] hostDevicePointers = new CUdeviceptr[blockSizeX];
for (int i = 0; i < blockSizeX; i++)
{
    hostDevicePointers[i] = new CUdeviceptr();
    cuMemAlloc(hostDevicePointers[i], size * Sizeof.INT);
}

Allocate device memory for the array of pointers that point to the other array, and copy array pointers from the host to the device.

CUdeviceptr hostDevicePointersArray = new CUdeviceptr();
cuMemAlloc(hostDevicePointersArray, blockSizeX * Sizeof.POINTER);
cuMemcpyHtoD(hostDevicePointersArray, Pointer.to(hostDevicePointers), blockSizeX * Sizeof.POINTER);

Launch the kernel.

kernelLauncher.call(........, hostDevicePointersArray);

Transfer the output from the device to host.

int hostOutputData[] = new int[numberofelementsInArray * blockSizeX];
cuMemcpyDtoH(Pointer.to(hostOutputData), hostDevicePointers[i], numberofelementsInArray * blockSizeX * Sizeof.INT);

for (int j = 0; j < size; j++)
{
    sum = sum + hostOutputData[j];
}

JCuda: copy multidimensional array from device to host

1 Answers1