0

I am programming in C++/CUDA and have faced a problem:

__global__ void KERNEL(int max_size, double* x, double* y, double* z)
{
      double localArray_x[max_size]
      double localArray_y[max_size]
      double localArray_z[max_size]
      //do stuff here
}

Right now my only solution to that is predefining max_size like this:

#define max_size 20

These arrays are the main focus of my kernel work. Basically, I have global coordinates and only segments of these coordinates, based on location within simulation box, are added to the three local_arrays. Then work is done on those coordinates and finally those coordinates are added back to the global arrays at the end of the simulation (x, y, z). Because of this, there are certain constraints on the arrays:

  1. Each thread called should have max_size*3 array elements to manipulate.
  2. Those arrays are used extensively and therefore the kernel needs to be able to access them quickly (or locally).
  3. max_size can't be a constant since the number density of my coordinates is variable based on input to the host.

I know there are versions of this post across StackOverflow but I believe what I need is different than the simple shared memory declaration. I'm just looking for some guidance on what can be done and what the fastest of these options are.

If relevant, max_size will be the same (constant) within every simulation. In other words, it only changes from one simulation to another and never within the same one.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • 1
    Use C++ templates with `max_size` as a parameter? – talonmies Mar 05 '19 at 03:53
  • Thank you for the edit and answer! I'll try that. (p.s. I'm a big fan of your answers! keep up the great work) – Pierre Kawak Mar 05 '19 at 16:21
  • 1
    It isn't really an answer, more of a comment/question. If you have a finite number of sizes to work with, templates make sense and don't add too much complexity to the calling host code. If you think it works, I would encourage you to answer this question yourself with your solution – talonmies Mar 05 '19 at 16:24
  • Yeah I'm working on it right now (https://stackoverflow.com/a/6179580). I will time my code after that change and evaluate this. Do you think this move will jeapordize NVCC's ability to optimize my kernel call? – Pierre Kawak Mar 05 '19 at 16:43
  • That answer isn't what I meant. But in any case the whole point of templates is that the compiler can optimize because the size is known at compile time – talonmies Mar 05 '19 at 16:55
  • max_size would not be known at compile time though since it is in an input file that main will read in before any operation. – Pierre Kawak Mar 05 '19 at 17:05
  • Yes I understand this, but I imagine there are a finite number of possible values. You can instantiate all of them at compile time. You can also use the JIT compilation facility to compile whatever value you want. – talonmies Mar 05 '19 at 17:56
  • Right. I will look into these. – Pierre Kawak Mar 05 '19 at 19:20
  • https://stackoverflow.com/a/35830749/10051264 Your answer here seems like what I need now that I've done some research on this. But how do I tell C++ that max_size, which is only known at runtime, can only be whatever values I've instantiated the kernel with? Also, how do I instantiate for specific discrete values and not an evenly spaced list (10-20 in your example)? So for example, let's say max_size can be 20, 35 and 50 – Pierre Kawak Mar 05 '19 at 21:42

1 Answers1

1

This was simpler than I thought. Use new and delete to achieve this, the same way that you would do it on the host.

The only difference is the need to use a runtime API call that allocates memory on the heap for your purposes:

cudaDeviceSetLimit(cudaLimitMallocHeapSize, heapsize);

where heapsize for a system running N kernels with 3 int arrays sized N_SIZE each:

size_t heapsize = (size_t)( N*3*N_SIZE*sizeof(int) );