How can I pass a C++ array of structs to a CUDA device?

Question

I've spent 2 days trying to figure this out and getting nowhere. Say I had a struct that looks like this:

struct Thing {
    bool is_solid;
    double matrix[9];
}

I want to create an array of that struct called things and then process that array on the GPU. Something like:

Thing *things;
int num_of_things = 100;
cudaMallocManaged((void **)&things, num_of_things * sizeof(Thing));

// Something missing here? Malloc individual structs? Everything I try doesn't work.

things[10].is_solid = true; // Segfaults

Is it even best practice to do it this way rather than pass a single struct with arrays that are num_of_things large? It seem to me that can get pretty nasty especially when you have arrays already (like matrix, which would need to be 9 * num_of_things.

Any info would be much appreciated!

Your code works fine for me. [Here is my full test case](http://pastebin.com/tBXqFF1C) (you should provide something like this). There is nothing wrong with your approach; it would be typical for how you would do it in a managed scenario. Most likely you are making one of several errors: 1. You are not compiling for a cc3.0+ device. 2. You are not running on a cc3.0+ device. 3. Your environment [does not support managed memory usage](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements). — Robert Crovella, Dec 18 '15 at 04:18
In any event, the best suggestion is to employ [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) any time you are having trouble with a cuda code, which I have demonstrated in my test case. Try running the code in my test case and see what kind of output you get; I think it will be instructive as to what is going on. — Robert Crovella, Dec 18 '15 at 04:18
Thanks a lot! It turned out what I was doing wrong was that I was passing a third argument `0` to `cudaMallocManaged` since the compiler was saying not enough arguments. I didn't think it was relevant so I left it out of the question. It needed to be `cudaMemAttachGlobal` or `cudaMemAttachHost` as one very kind individual pointed out on freenode/#cuda. — Yousef Amar, Dec 18 '15 at 04:37
Are you using CUDA 7.5? Although the documentation is not clear in this respect, the compiler will/should accept the `cudaMallocManaged` call with just 2 arguments (as you indicated in this question, and as I indicated in my example); in that case the `flags` variable will default to `cudaMemAttachGlobal`. If you are saying "the compiler was saying not enough arguments" I can only assume this was Visual Studio Intellisense? (not actually the compiler) Anyway you can convince yourself of the default argument for the `flags` parameter by inspecting `cuda_runtime.h` in the cuda include directory — Robert Crovella, Dec 18 '15 at 05:09
Thanks, I checked and it does default to that in `cuda_runtime.h`, but still won't compile with just 2 args. Not sure why. My setup is pure command line, g++, CMake's FindCuda, and nvcc with flags `-gencode arch=compute_30,code=sm_30`. The fellow in IRC mentioned that that might be a C thing, but I think it's not that since the file is .cu and everything else is c++. — Yousef Amar, Dec 18 '15 at 05:46
what happens if you compile the code I provided in my test case (first comment) using the command line I provided there (not using CMake) Do you still get a compile error? — Robert Crovella, Dec 18 '15 at 06:09
That gives me "Success!" too (though had to run nvcc with sudo to fix a fatbinary fatal). Very strange indeed. I don't really have any idea why it doesn't accept 2 args given that your test case works. — Yousef Amar, Dec 18 '15 at 06:39

score 3 · Accepted Answer · edited May 23 '17 at 12:30

After some dialog in the comments, it seems that OP's posted code has no issues. I was able to successfully compile and run this test case built around that code, and so was OP:

$ cat t1005.cu
#include <iostream>

struct Thing {
    bool is_solid;
    double matrix[9];
};

int main(){

  Thing *things;
  int num_of_things = 100;
  cudaError_t ret = cudaMallocManaged((void **)&things, num_of_things * sizeof(Thing));
  if (ret != cudaSuccess) {
    std::cout << cudaGetErrorString(ret) << std::endl;
    return 1;}
  else {
    things[10].is_solid = true;
    std::cout << "Success!" << std::endl;
    return 0;}
}
$ nvcc -arch=sm_30 -o t1005 t1005.cu
$ ./t1005
Success!
$

Regarding this question:

Is it even best practice to do it this way rather than pass a single struct with arrays that are num_of_things large?

Yes, this is a sensible practice and is usable whether managed memory is being used or not. An array of more or less any structure that does not contain embedded pointers to dynamically allocated data elsewhere can be transferred to the GPU in a simple fashion using a single cudaMemcpy call (for example, if managed memory were not being used.)

To address the question about the 3rd (flags) parameter to cudaMallocManaged:

If it is specified, it is not correct to pass zero (although OP's posted code gives no evidence of that.) You should use one of the documented choices.
If it is not specified, this is still valid, and a default argument of cudaMemAttachGlobal is provided. This can be confirmed by reviewing the cuda_runtime.h file or else simply compiling/running the test code above. This particular point appears to be an oversight in the documentation, and I've filed an internal issue at NVIDIA to take a look at that. So it's possible the documentation may change in the future with respect to this.

Finally, proper cuda error checking is always in order any time you are having trouble with a CUDA code, and the use of such may shed some light on any errors that are made. The seg fault that the OP reported in code comments was almost certainly due to the cudaMallocManaged call failing (perhaps because a zero parameter was supplied incorrectly) and as a result the pointer in question (things) had no actual allocation. Subsequent usage of that pointer would lead to a seg fault. My test code demonstrates how to avoid that seg fault, even if the cudaMallocManaged call fails for some reason, and the key is proper error checking.

How can I pass a C++ array of structs to a CUDA device?

1 Answers1