Lets say I want a CUDA kernel that needs to do lots of stuff, but there are dome parameters that are constant to all the kernels. this arguments are passed to the main program as an input, so they can not be defined in a #DEFINE
.
The kernel will run multiple times (around 65K) and it needs those parameters (and some other inputs) to do its maths.
My question is: whats the fastest (or else, the most elegant) way of passing these constants to the kernels?
The constants are 2 or 3 element length float*
or int*
arrays. They will be around 5~10 of these.
toy example: 2 constants const1
and const2
__global__ void kernelToyExample(int inputdata, ?????){
value=inputdata*const1[0]+const2[1]/const1[2];
}
is it better
__global__ void kernelToyExample(int inputdata, float* const1, float* const2){
value=inputdata*const1[0]+const2[1]/const1[2];
}
or
__global__ void kernelToyExample(int inputdata, float const1x, float const1y, float const1z, float const2x, float const2y){
value=inputdata*const1x+const2y/const1z;
}
or maybe declare them in some global read only memory and let the kernels read from there? If so, L1, L2, global? Which one?
Is there a better way I don't know of?
Running on a Tesla K40.