I need to do data reduction (find k-max number) on vector of N numbers. The problem is I don't know the N beforehand (before compilation), and I am not sure if I'm doing it right when I'm constructing two kernels - one with (int)(N / block_size)
blocks and the second kernel with one block of N % block_size
threads.
Is there a better way to process "undividable" count of numbers by block_size in CUDA?