6

I'd like to clean up the parameters of CUDA kernels in my project.


Now, a kernel needs 3 uint32_t arrays, which leads to pretty ugly code: (id means the global thread id and valX is some arbitrary value)

__global__ void some_kernel(uint32_t * arr1, uint32_t * arr2, uint32_t * arr3){arr1[id] = val1; arr2[id] = val2; arr3[id] = val3;}

I'd like to sorround all those arrays with a struct:

typedef struct S{uint_32_t arr1, uint_32_t arr2, uint_32_t arr3, uint32_t size} S;

where size denotes the length of every arrX inside the struct.

What I would like to have, is something like:

__global__ void some_kernel(S * s){s->arr1[id] = val1; s->arr2[id] = val2; s->arr3[id] = val3;}

What would a corresponding cudaMalloc and cudaMemcpy would look like for a struct like this? Are there any performance drawbacks from this, which I'm not seeing yet?

Thanks in advance!

  • Have you tried `cudaMemcpy(dst, src, number_of_Ss * sizeof(S), cudaMemcpyHostToDevice)` with `dst` and `src` being `S*`s and `number_of_Ss` being the number of `S`s that you want to copy? – triple_r Jul 23 '15 at 21:13
  • Yes, and I got this: test.cu(27): error: no suitable conversion function from "S" to "const void *" exists ...right at the memcopy – Daniel Jünger Jul 23 '15 at 21:19
  • 1
    Why don't you just pass the structure by value? No cudaMalloc or cudaMemcpy required. – talonmies Jul 23 '15 at 21:34
  • @talonmies I'm confused. Could you explain this to me? Let's say size=10^10 . Also, let's say I need to memcopy because I initialized each array entry on the host-side. How do I pass a struct this big by-value? – Daniel Jünger Jul 23 '15 at 21:47
  • 3
    Make a structure containing pointers. cudaMalloc each pointer. Pass the structure by value. If you don't understand that, then I think you need to revise pointers, references and values in C++. CUDA is conceptually a bit complicated, but you need to understand C or C++ thoroughly before trying to write CUDA code. Your first C++ program shouldn't also be your first CUDA program. – talonmies Jul 23 '15 at 21:52
  • The error says it cannot convert from `S` to `const void *`, you need to pass a pointer to memcopy and not the actual variable. So `S *src, dst; src = (S *)malloc(n * sizeof(S)); cudaMalloc((void **)&dst, n*sizeof(S)); cudaMemcpy(dst, src, n * sizeof(S), cudaMemcpyHostToDevice);` – triple_r Jul 23 '15 at 22:03
  • Thanks, I needed that for clearance. I'm not a newbie to C++ - neither to CUDA. I'll put the cudaMallocs inside the struct-constructor (one for host and one for device) and then pass by reference. That should do the trick. ;) – Daniel Jünger Jul 23 '15 at 22:04
  • 1
    I don't recommend putting device memory allocation and deallocation in your constructor and destructor unless you are truly careful about managing scope. It can lead to some very hard to diagnose runtime errors – talonmies Jul 24 '15 at 06:05

1 Answers1

7

You have at least two options. One excellent choice was already given by talonmies, but I'll introduce you to the "learn the hard way" approach.

First, your struct definition:

typedef struct S {
    uint32_t *arr1;
    uint32_t *arr2;
    uint32_t *arr3; 
    uint32_t size;
} S;

...and kernel definition (with some global variable, but you don't need to follow with that pattern):

const int size = 10000;

__global__ void some_kernel(S *s)
{
    int id = blockIdx.x * blockDim.x + threadIdx.x;
    if (id < size)
    {
        s->arr1[id] = 1; // val1
        s->arr2[id] = 2; // val2
        s->arr3[id] = 3; // val3
    }
}

Notice that if protects you from running out-of-bounds.

Next, we come with some function that prepares data, executes kernel and prints some result. Part one is data allocation:

uint32_t *host_arr1, *host_arr2, *host_arr3;
uint32_t *dev_arr1, *dev_arr2, *dev_arr3;

// Allocate and fill host data
host_arr1 = new uint32_t[size]();
host_arr2 = new uint32_t[size]();
host_arr3 = new uint32_t[size]();

// Allocate device data   
cudaMalloc((void **) &dev_arr1, size * sizeof(*dev_arr1));
cudaMalloc((void **) &dev_arr2, size * sizeof(*dev_arr2));
cudaMalloc((void **) &dev_arr3, size * sizeof(*dev_arr3));

// Allocate helper struct on the device
S *dev_s;
cudaMalloc((void **) &dev_s, sizeof(*dev_s));

It's nothing special, you just allocate three arrays and struct. What looks more interesting is how to handle copying of such data into device:

// Copy data from host to device
cudaMemcpy(dev_arr1, host_arr1, size * sizeof(*dev_arr1), cudaMemcpyHostToDevice);
cudaMemcpy(dev_arr2, host_arr2, size * sizeof(*dev_arr2), cudaMemcpyHostToDevice);
cudaMemcpy(dev_arr3, host_arr3, size * sizeof(*dev_arr3), cudaMemcpyHostToDevice);

// NOTE: Binding pointers with dev_s
cudaMemcpy(&(dev_s->arr1), &dev_arr1, sizeof(dev_s->arr1), cudaMemcpyHostToDevice);
cudaMemcpy(&(dev_s->arr2), &dev_arr2, sizeof(dev_s->arr2), cudaMemcpyHostToDevice);
cudaMemcpy(&(dev_s->arr3), &dev_arr3, sizeof(dev_s->arr3), cudaMemcpyHostToDevice);

Beside ordinary copy of array you noticed, that it's also neccessary to "bind" them with the struct. For that you need to pass an address of pointer. As result, only these pointers are copied.

Next kernel call, copy data back again to host and printing results:

// Call kernel
some_kernel<<<10000/256 + 1, 256>>>(dev_s); // block size need to be a multiply of 256

// Copy result to host:
cudaMemcpy(host_arr1, dev_arr1, size * sizeof(*host_arr1), cudaMemcpyDeviceToHost);
cudaMemcpy(host_arr2, dev_arr2, size * sizeof(*host_arr2), cudaMemcpyDeviceToHost);
cudaMemcpy(host_arr3, dev_arr3, size * sizeof(*host_arr3), cudaMemcpyDeviceToHost);

// Print some result
std::cout << host_arr1[size-1] << std::endl;
std::cout << host_arr2[size-1] << std::endl;
std::cout << host_arr3[size-1] << std::endl;

Keep in mind that in any serious code you should always check for errors from CUDA API calls.

Community
  • 1
  • 1
Grzegorz Szpetkowski
  • 36,988
  • 6
  • 90
  • 137
  • 2
    If you build the device structure in host memory first and then copy it to `dev_s`, you could replace the three memory copies in the `// NOTE: Binding pointers with dev_s` section with a single memcpy. That's going to be a lot simpler and faster – talonmies Jul 24 '15 at 06:43