Is this way of allocating a device object "correct"?

Question

SO I asked a question before about how to allocate an object on the device directly instead of the "normal":

Allocate on host
Copy to device
Copy dynamically allocated fields to device one by one

The main reason I want it to be allocated directly on the device is that I don't want to copy each dynamically allocated field inside one by one manually.

Anyway, so I think I have actually found a way to do this, and I would like to see some input from more experienced CUDA programmers (like Robert Crovella).

Let's see the code first:

class Particle
{
    public:
    int *data;

    __device__ Particle()
    {
        data = new int[10];
        for (int i=0; i<10; i++)
        {
            data[i] = i*2;
        }
    }
};


__global__ void test(Particle **result)
{
    Particle *p = new Particle();

    result[0] = p; // store memory location
}

__global__ void test2(Particle *p)
{
    for (int i=0; i<10; i++)
        printf("%d\n", p->data[i]);

}

int main() {
    // initialise and allocate an object on device
    Particle **d_p_addr;
    cudaMalloc((void**)&d_p_addr, sizeof(Particle*));
    test<<<1,1>>>(d_p_addr);

    // copy pointer to host memory
    Particle  **p_addr = new Particle*[1];
    cudaMemcpy(p_addr, d_p_addr, sizeof(Particle*), cudaMemcpyDeviceToHost);

    // test:
    test2<<<1,1>>>(p_addr[0]);

    cudaDeviceSynchronize();

    printf("Done!\n");

}

As you can see, what I do is:

Call a kernel that initialises an object on the device and stores its pointer an output parameter
Copy the pointer to the allocated object from device memory to host memory
Now you can pass that pointer to another kernel just fine !

This code actually works, but I'm not sure if there are drawbacks.

Cheers

EDIT: as pointed out by Robert, there was no point of creating a pointer on host first, so I removed that part from the code.

Robert Crovella · Answer 1 · 2013-04-19T21:04:05.917

Yes, you can do that.

You are allocating an object on the device, and passing a pointer to it from one kernel to the next. Since a characteristic of device malloc/new is that allocations persist for the lifetime of the context (not just the kernel), the allocations do not disappear at the end of the kernel. This is basically standard C++ behavior, but I thought it might be worth repeating. The pointer(s) that you are passing from one kernel to the next are therefore valid in any subsequent device code in the context of your program.

There is a wrinkle you might want to be aware of, however. Pointers returned by dynamic allocations done on the device (such as via new or malloc in device code) are not usable for transferring data from device to host, at least in the present incarnation of cuda (cuda 5.0 and earlier). The reasons for this are somewhat arcane (translation: I can't adequately explain it) but it's instructive to think about the fact that dynamic allocations come out of the device heap, a region that is logically separate from the region of global memory that runtime API functions like cudaMalloc and cudaMemcpy use. An oblique indication of this is given here:

Memory reserved for the device heap is in addition to memory allocated through host-side CUDA API calls such as cudaMalloc().

If you want to prove this wrinkle to yourself, try adding the following seemingly innocuous code after your second kernel call:

Particle *q;
q = (Particle *)malloc(sizeof(Particle));
cudaMemcpy(q, p_addr[0], sizeof(Particle), cudaMemcpyDeviceToHost);

If you then check the API error value returned from that cudaMemcpy operation, you will observe the error.

As an unrelated comment, your use of the pointer *p is a little freaky, in my book, and the compiler warning given about it is an indication of the wierdness. It's not technically illegal, since you're not actually doing anything meaningful with that pointer (you immediately replace it in your kernel 1) but nevertheless it's wierd because you're passing a pointer to a kernel that you haven't properly cudaMalloc'ed. In the context of what you're demonstrating, it's completely unnecessary, and your first parameter to kernel 1 could be eliminated and replaced with a local variable, eliminating the wierdness and compiler warning.

So when does the memory allocated on the heap go away? when do I get out of context? — SpaceMonkey, Apr 19 '13 at 21:25
It goes away when the cuda context gets destroyed (which would normally be at application termination) or earlier if you explicitly call `free()` or `delete`. Again, I believe this mimics [standard C++ behavior](http://stackoverflow.com/questions/4570210/c-pointer-scope). — Robert Crovella, Apr 19 '13 at 21:29

Is this way of allocating a device object "correct"?

1 Answers1