How to copy structured data allocated in device memory from device to host

Question

I'm new to GPU and CUDA programming. I am trying to copy structured data dynamically allocated on device from device to host. I modified a simple code from GPU programming guide. I don't get any error when compiling the code but the only thing I have that's problematic is the output is wrong i.e. '0'. Here's the code:

#include <stdlib.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

typedef struct Point
{
    int2 pt;
};

#define NUMOFBLOCKS 1
#define THREDSPERBLOCK  16

__device__ Point* pnt[NUMOFBLOCKS];
Point dataptr_h[NUMOFBLOCKS][THREDSPERBLOCK];

__global__ void allocmem() 
{   
    if (threadIdx.x == 0)       
        pnt[blockIdx.x] = (Point*)malloc(1*blockDim.x * sizeof(Point)); 
    __syncthreads(); 
} 

__global__ void usemem() 
{ 
    Point* ptr = pnt[blockIdx.x]; 
    if (ptr != NULL) 
    {       
        ptr[threadIdx.x].pt.x = threadIdx.x; 
        ptr[threadIdx.x].pt.y = threadIdx.x;
        printf("Ptr = %d\t", ptr[threadIdx.x].pt.x);
    }
}

__global__ void freemem() 
{ 
    Point* ptr = pnt[blockIdx.x]; 
    if (ptr != NULL) 
        printf("Block %d, Thread %d: final value = %d\n", blockIdx.x, threadIdx.x, ptr[threadIdx.x]); 
    if (threadIdx.x == 0) 
        free(ptr); 
}


int main()
{
    Point* d_pt[NUMOFBLOCKS];
    for (int i = 0 ; i < NUMOFBLOCKS; i++)
        cudaMalloc(&d_pt[i], sizeof(Point)*16);  

    // Allocate memory  
    allocmem<<< NUMOFBLOCKS, THREDSPERBLOCK >>>();  
    // Use memory 
    usemem<<< NUMOFBLOCKS, THREDSPERBLOCK >>>(); 
    cudaMemcpyFromSymbol(d_pt, pnt, sizeof(d_pt));
    cudaMemcpy(dataptr_h, d_pt, sizeof(dataptr_h), cudaMemcpyDeviceToHost);

    for (int j = 0 ; j < 1; j++)
        for (int i = 0 ; i < 16; i++)
        {
            printf("\nPtr_h(%d,%d)->X = %d\t", j, i, dataptr_h[j][i].pt.x);
            printf("Ptr_h(%d,%d)->Y = %d", j, i, dataptr_h[j][i].pt.y);
        }

    freemem<<< NUMOFBLOCKS, THREDSPERBLOCK >>>();
    cudaDeviceSynchronize();
    return 0;
}

The output of the code is:

Ptr_h(0,0)->X = 0       Ptr_h(0,0)->Y = 0
Ptr_h(0,1)->X = 0       Ptr_h(0,1)->Y = 0
Ptr_h(0,2)->X = 0       Ptr_h(0,2)->Y = 0
Ptr_h(0,3)->X = 0       Ptr_h(0,3)->Y = 0
Ptr_h(0,4)->X = 0       Ptr_h(0,4)->Y = 0
Ptr_h(0,5)->X = 0       Ptr_h(0,5)->Y = 0
Ptr_h(0,6)->X = 0       Ptr_h(0,6)->Y = 0
Ptr_h(0,7)->X = 0       Ptr_h(0,7)->Y = 0
Ptr_h(0,8)->X = 0       Ptr_h(0,8)->Y = 0
Ptr_h(0,9)->X = 0       Ptr_h(0,9)->Y = 0
Ptr_h(0,10)->X = 0      Ptr_h(0,10)->Y = 0
Ptr_h(0,11)->X = 0      Ptr_h(0,11)->Y = 0
Ptr_h(0,12)->X = 0      Ptr_h(0,12)->Y = 0
Ptr_h(0,13)->X = 0      Ptr_h(0,13)->Y = 0
Ptr_h(0,14)->X = 0      Ptr_h(0,14)->Y = 0
Ptr_h(0,15)->X = 0      Ptr_h(0,15)->Y = 0

What can I do to fix this?

You should do proper [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) on all CUDA API calls and kernel calls. It will point out the lines of code you are having trouble with. Since your kernel printf statements don't show up, it's reasonably clear that your kernels are not executing properly. Running your code with `cuda-memcheck` will likely shed light on that. — Robert Crovella, Oct 22 '13 at 02:20
In fact, when I run your code, some of your kernel printf statements do show up. So if you're not seeing `Ptr = 0 Ptr = 1 ...` then you may have another issue (machine configuration) as well. But proper cuda error checking will help you discover if that is the case also. — Robert Crovella, Oct 22 '13 at 02:55
Thank you Robert for your comment. But printf works well in kernel and I can see Ptr = 0 Ptr = 1 ... The only problem is that this data is not passed / copied to host. I'm trying to follow your next answer... — MUB, Oct 22 '13 at 21:39

score 1 · Accepted Answer · answered Oct 22 '13 at 02:44

1

You cannot use a pointer created by a device malloc operation with the CUDA runtime API (i.e. cudaMemcpy)

So this line of code is problematic:

cudaMemcpy(dataptr_h, d_pt, sizeof(dataptr_h), cudaMemcpyDeviceToHost);

d_pt contains pointers picked up from pnt. And pnt had it's value(s) set by device malloc.

Instead you'll need to create areas properly allocated with cudaMalloc, then copy the data you want to those areas first (from one area on the device to another area), then copy to the host using cudaMemcpy.

Before I go farther to explain your next objection, let's be clear that the above is your intent (to use a pointer created in a device malloc operation as one of the targets of a cudaMemcpy). This is not legal.

"But I used cudaMalloc ??"

d_pt is an array of pointers that live in host memory. You took each one of those pointers and assigned it a value (a pointed-to location in device memory) using cudaMalloc.

Then this line of code:

cudaMemcpyFromSymbol(d_pt, pnt, sizeof(d_pt));

over-wrote all those pointers that you set up with pointers obtained from elsewhere in device memory, specifically pointers assigned by device malloc. While this is technically legal (that line of code does not throw an error) those pointers are useless on the host (for use with the runtime API, anyway).

answered Oct 22 '13 at 02:44

Robert Crovella

143,785
11
213
257

Thank you for your explanation. What is the alternate way, if we cannot use a pointer created by a device operation with the CUDA runtime API ()? What does do then? I mean what is the purpose of that function if it doesn't pass the pointer allocated in the device? – MUB Oct 23 '13 at 09:50
The alternate way is to allocate memory areas using `cudaMalloc` and use those instead of device-allocated areas. I already indicated this in my response. `cudaMemcpyFromSymbol` copies data from a `__device__` area/variable to the host. It serves a pupose similar to `cudaMemcpy` – Robert Crovella Oct 23 '13 at 17:27
I tried to allocate memory using `cuaMalloc` but could not get thru... Moreover, I read in the CUDA programming guide 4.2 that Memory allocated via `malloc()` can be copied using the runtime (i.e. by calling any of the copy memory functions). Please refer to the `section B.17.2 of CUDA C Programming Guide 4.2`. – MUB Oct 24 '13 at 22:18
So, I'm a bit confused now. My understanding (I could be wrong) is that mem copy should work using `cudaMemcpyFromSymbol` if we copy from device i.e. symbol `pnt` to host `dataptr_h`, as explained by you too that `cudaMemcpyFromSymbol` copies data from a `__device__` area to the host. But it's not working here. – MUB Oct 24 '13 at 22:26
So you've referenced the CUDA 4.2 programming guide. Are you using CUDA 4.2? – Robert Crovella Oct 25 '13 at 01:04
I've CUDA 4.1, 4.2 and 5.0, but the data copy works on none of those. Although I couldn't find this reference in CUDA 5 programming guide, yet it dosen't work on either of the versions. – MUB Oct 25 '13 at 03:27
My claim, as I've stated already in my answer, is that using a device pointer created with `malloc` does not work with the host APIs. The statement you found in the CUDA 4.2 documentation is not present in the current documentation, as far as I know. – Robert Crovella Oct 25 '13 at 13:03
Yes, you are right. I've done some experiments and found that it doesn't work...but that's weird... Anyway, thank you for your help. – MUB Oct 26 '13 at 00:55

How to copy structured data allocated in device memory from device to host

1 Answers1