When writing a kernel that will operate either on one element of a 1D array, with a large dimension N that can be from 2 to 100s of millions, or on one element with the same index from several 1D arrays with the same dimension N, what is the optimal way to compute the grid size for the kernel? In particular, is the following the optimal way:
CPU part:
cudaSetDevice(device);
cudaGetDeviceProperties(&deviceProp, device);
max_threads_per_block = deviceProp.maxThreadsPerBlock;
number_CUDA_cores = _ConvertSMVer2Cores(deviceProp.major, deviceProp.minor) * deviceProp.multiProcessorCount;
GPU part:
if (N <= number_CUDA_cores) {
blocks = N;
threads_per_block = 1;
}
else {
blocks = number_CUDA_cores;
threads_per_block = (N + number_CUDA_cores - 1) / number_CUDA_cores;
if (threads_per_block > max_threads_per_block) {
blocks = (N + max_threads_per_block - 1) / max_threads_per_block;
threads_per_block = max_threads_per_block;
}
}
kernel<<<blocks, threads_per_block>>>( ..... )
Again, this has to work well for sizes of N from 2 to 100s of million on a K20 or K20x, and better.
Will the answer be different if:
1) the kernel is mostly straight code
2) the kernel has many if-statements, and they can diverge in execution (that is why I am trying to keep the blocks as small as possible).