How to use cudaMalloc / cudaMemcpy for a pointer to a structure containing pointers?

Question

I've looked all around this site and others, and nothing has worked. I'm resorting to posting a question for my specific case.

I have a bunch of matrices, and the goal is to use a kernel to let the GPU to do the same operation on all of them. I'm pretty sure I can get the kernel to work, but I can't get cudaMalloc / cudaMemcpy to work.

I have a pointer to a Matrix structure, which has a member called elements that points to some floats. I can do all the non-cuda mallocs just fine.

Thanks for any/all help.

Code:

typedef struct {
    int width;
    int height;
    float* elements;
} Matrix;

int main void() {
    int rows, cols, numMat = 2; // These are actually determined at run-time
    Matrix* data = (Matrix*)malloc(numMat * sizeof(Matrix));

    // ... Successfully read from file into "data" ...

    Matrix* d_data;
    cudaMalloc(&d_data, numMat*sizeof(Matrix)); 
    for (int i=0; i<numMat; i++){
        // The next line doesn't work
        cudaMalloc(&(d_data[i].elements), rows*cols*sizeof(float));

        // Don't know if this works
        cudaMemcpy(d_data[i].elements, data[i].elements,  rows*cols*sizeof(float)), cudaMemcpyHostToDevice);
    }

    // ... Do other things ...
}

Thanks!

It won't work this way. You have allocated `d_data` using `cudaMalloc` and trying to access `d_data[i]` on the host which is not possible. — sgarizvi, Oct 16 '13 at 13:40
A better approach would be to allocate `d_data` on the host using `malloc` and then allocate `d_data.elements` on the device using `cudaMalloc`. It is not clear how are you using the allocated structure inside the device code. — sgarizvi, Oct 16 '13 at 13:43
Thanks @sgar91. But where do you say I'm trying to acces d_data[i]? — t_carn, Oct 16 '13 at 13:43
Here in the first argument: `cudaMemcpy(d_data[i].elements, data[i].elements, rows*cols*sizeof(float)), cudaMemcpyHostToDevice);`. Trying to access device pointer on the host. — sgarizvi, Oct 16 '13 at 13:46
@sgar91 - In the device code, I just need to be able to operate on the elements of each matrix. I haven't seen allocating device structures on the host whose members are on the device. Could you post a quick example? Thank you! — t_carn, Oct 16 '13 at 13:47

score 8 · Accepted Answer · answered Oct 16 '13 at 13:53

You have to be aware where your memory resides. malloc allocates host memory, cudaMalloc allocates memory on the device and returns a pointer to that memory back. However, this pointer is only valid in device functions.

What you want could be achived as followed:

typedef struct {
    int width;
    int height;
    float* elements;
} Matrix;

int main void() {
    int rows, cols, numMat = 2; // These are actually determined at run-time
    Matrix* data = (Matrix*)malloc(numMat * sizeof(Matrix));

    // ... Successfully read from file into "data" ...
    Matrix* h_data = (Matrix*)malloc(numMat * sizeof(Matrix));
    memcpy(h_data, data, numMat * sizeof(Matrix);

    for (int i=0; i<numMat; i++){

        cudaMalloc(&(h_data[i].elements), rows*cols*sizeof(float));
        cudaMemcpy(h_data[i].elements, data[i].elements,  rows*cols*sizeof(float)), cudaMemcpyHostToDevice);

     }// matrix data is now on the gpu, now copy the "meta" data to gpu
     Matrix* d_data;
     cudaMalloc(&d_data, numMat*sizeof(Matrix)); 
     cudaMemcpy(d_data, h_data, numMat*sizeof(Matrix));
     // ... Do other things ...
}

To make things clear: Matrix* data contains the data on the host. Matrix* h_data contains a pointer to the device memory in elements which can be passed to the kernels as parameters. The memory is on the GPU. Matrix* d_data is completly on the GPU and can be used like data on the host.

in your kernel code you kann now access the matrix values, e.g.,

__global__ void doThings(Matrix* matrices)
{
      matrices[i].elements[0] = 42;
}

thanks @kronos, I will try this. Is the idea of having the "intermediate pointer" `h_data` the standard way to do this? — t_carn, Oct 16 '13 at 14:03
Well that depends. It represents an overhead in your host code, because you have to store your other struct data twice. This could be leads to bugs. You can do 2 things: pack the device pointers into an array and pass the array to a kernel (the array must be allocated in device memory as well), or you add a field to your struct that holds the device pointer. With the second oppinion you can use the same structure on host and device side but accessing host data via elements and device data, lets say, via d_elements. — Michael Haidl, Oct 16 '13 at 14:09
This is very helpful. It makes me wonder if I need to put anything entirely onto the device at all - if I can just call the kernel with host pointers to device memory. Thanks again! — t_carn, Oct 16 '13 at 14:40

How to use cudaMalloc / cudaMemcpy for a pointer to a structure containing pointers?

1 Answers1

Linked