1

I recently meet a problem when copying dynamically allocated data in device to host memory. The data is allocated with malloc, and I copy those data from device to host in host function. Here is the code:

#include <cuda.h> 
#include <stdio.h> 

#define N 100 
__device__ int* d_array; 
__global__ void allocDeviceMemory() 
{ 
d_array = new int[N]; 
for(int i=0; i < N; i++) 
d_array[i] = 123; 
} 
int main() 
{ 
allocDeviceMemory<<<1, 1>>>(); 
cudaDeviceSynchronize(); 
int* d_a = NULL; 
cudaMemcpyFromSymbol((void**)&d_a, "d_array", sizeof(d_a), 0, cudaMemcpyDeviceToHost); 
printf("gpu adress: %p\n", d_a); 

int* h_array = (int*)malloc(N*sizeof(int)); 
cudaError_t errr = cudaMemcpy(h_array, d_a, N*sizeof(int), cudaMemcpyDeviceToHost); 
printf("h_array: %d, %d\n", h_array[0], errr); 

getchar(); 
return 0; 
} 

There is already a poster had the same issue for CUDA 4.1, and some experts suggest upgreading the CUDA driver and runtime to newer version can solve this issue. CUDA - Copy device data to host?

I have CUDA toolkit 4.2 and lastest developer drivers and C2075, but it still come up with the above problem. Please let me know how to solve this problem.

Community
  • 1
  • 1
li7hui
  • 33
  • 5

1 Answers1

1

Unfortunately there is no way to do what you are trying to do it CUDA 4. The host API cannot copy from dynamically allocated addresses on device runtime heap, only device code can access them. If you want to copy with the host API, you will need to write the data into an "output" buffer allocated with the host API first, then you are free to use cudaMemcpy to retrieve it from the host.

You can see confirmation of this limitation from Mark Harris of Nvidia here.


Since this answer was posted in 2012, the restriction on host API interoperability appears to have been set in stone, and is explicitly documented in the CUDA programming guide.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Thank you Talonmies, I also saw a similar solution said that you can allocate buffer in host memory first, then pass that pointer to device kernel, and use the memcpy to copy data from device to host buffer within kernel. However, this does not solve my problem. Will CUDA 5 provide this kind of support please? – li7hui Jun 26 '12 at 10:04
  • I have yet tried out the CUDA 5 beta to see whether it works. But keep in mind that you are free to use device code allocations as long as you like, even over many kernel launches. It is only the "final copy" back to the host via the API that must be allocated by the host. – talonmies Jun 26 '12 at 17:35
  • Thank you Talonmies for answering the question. – li7hui Jun 26 '12 at 21:08