Is there a better way to process "undividable count of numbers by block_size" in CUDA?

Question

I need to do data reduction (find k-max number) on vector of N numbers. The problem is I don't know the N beforehand (before compilation), and I am not sure if I'm doing it right when I'm constructing two kernels - one with (int)(N / block_size) blocks and the second kernel with one block of N % block_size threads.

Is there a better way to process "undividable" count of numbers by block_size in CUDA?

score 3 · Answer 1 · answered Dec 04 '13 at 00:20

3

A typical approach is like this (1-D grid example):

#define DATA_SIZE ...   // this is some arbitrary number
#define NUM_THREADS_PER_BLOCK ...  // this is block size, usually a multiple of 32
                                  // typical choices are 256, 512, 1024 (pick one)

unsigned int N = DATA_SIZE;  
unsigned int nTPB = NUM_THREADS_PER_BLOCK; 
my_kernel<<<(N + nTPB - 1)/nTPB, nTPB>>>(...);

This assumes your kernel has a "thread check" at the beginning like this:

unsigned int idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < DATA_SIZE){
   // kernel code goes here
}

answered Dec 04 '13 at 00:20

Robert Crovella

143,785
11
213
257

Thx for the answer. But is it more efficient than creating two kernels without the IF statement? – zajac.m2 Dec 04 '13 at 08:07
The cost of the if statement is trivial (nanoseconds) vs. the cost of spinning up an additional kernel (microseconds). – Robert Crovella Dec 04 '13 at 14:22
That is disputable. The kernel start overhead depends on whether you are running on WDDM or not, but either way if you have few hundred millions of threads the IF statement overhead is not so trivial. The answer is good I am just curious :) – zajac.m2 Dec 04 '13 at 16:26
1

I prefer `if (idx >= DATA_SIZE) return;` so that I don't end up with the body of the kernel inside a big `if` clause. – Roger Dahl Dec 04 '13 at 18:55

score 1 · Accepted Answer · answered Dec 04 '13 at 19:08

@RobertCrovella's answer describes the standard way of handling the situation and there is typically no need to worry about the extra if conditional that is needed in the kernel.

However, another alternative is to allocate the input and output buffers with padding up to a number that is divisible by the block size, run the kernel (without the if) and then ignore the extra results, for instance by not copying them back to the CPU.

Is there a better way to process "undividable count of numbers by block_size" in CUDA?

2 Answers2

Linked