I am programming in C++/CUDA and have faced a problem:
__global__ void KERNEL(int max_size, double* x, double* y, double* z)
{
double localArray_x[max_size]
double localArray_y[max_size]
double localArray_z[max_size]
//do stuff here
}
Right now my only solution to that is predefining max_size like this:
#define max_size 20
These arrays are the main focus of my kernel work. Basically, I have global coordinates and only segments of these coordinates, based on location within simulation box, are added to the three local_arrays. Then work is done on those coordinates and finally those coordinates are added back to the global arrays at the end of the simulation (x, y, z). Because of this, there are certain constraints on the arrays:
- Each thread called should have max_size*3 array elements to manipulate.
- Those arrays are used extensively and therefore the kernel needs to be able to access them quickly (or locally).
- max_size can't be a constant since the number density of my coordinates is variable based on input to the host.
I know there are versions of this post across StackOverflow but I believe what I need is different than the simple shared memory declaration. I'm just looking for some guidance on what can be done and what the fastest of these options are.
If relevant, max_size will be the same (constant) within every simulation. In other words, it only changes from one simulation to another and never within the same one.