This is the very definition of annoying boilerplate. Clearly the kernel should never be invoked where index >= n.
OpenCL doesn't require that you do such a bounds check, it is already done for you outside of the kernel.
This is the very definition of annoying boilerplate. Clearly the kernel should never be invoked where index >= n.
OpenCL doesn't require that you do such a bounds check, it is already done for you outside of the kernel.
OpenCL / CUDA kernels run in parallel with thousands/millions of threads. These threads are grouped in workgroups of 32/64/128/256 threads, ideally 64. Within each workgroup threads can for example communicate data via local/shared memory, and restrictions apply for branching, such as if one thread is in slow branch A and all others in fast branch B, all threads in the workgroup have to wait til the one finishes A.
Workgroups are always executed as entire workgroups. That means: If you have a global size of 50 threads, each of which acts on one entry of an array of size 50, and workgroup size 32, you will have 2 workgroups. The first one (threads 0-31) is full, and every thread can read/write its entry in the array. But the second workgroup (threads 32-63) only partially accesses valid data in the array in positions (32-49), and threads (50-63) would read/write in unallocated memory space, causing a crash.
To prevent crashes from out-of-bounds memory access by the last partially filled workgroup, usually a guard clause if(thread_index>=global_range) return;
is introduced, so the dummy threads in the last partially filled workgroup return right away and don't access memory.
Only if you make sure by other sanity checks on the CPU side that your array size always is a clean multiple of the workgroup size, you can omit this guard clause.
Shorter:
CUDA has no independent definition of the global thread space, that could be used to limit the total threads to be consistent with the problem size. Therefore this limiting, if needed, must be done in kernel code. OpenCL provides an independent definition of the global work-item space/size, and this is used by the launch mechanism to make sure that no more than that many work-items are launched. If this suffices for problem size, no further in-kernel conditioning is necessary.
Longer:
In CUDA, the only definition of the "global thread space" (CUDA typically might call this "the grid") that we have is given by the grid dimension (number of blocks) and the block dimension (number of threads per block) in the kernel-launch syntax, for example using typical CUDA runtime API syntax:
kernel<<<number_of_blocks, threads_per_block,...>>>(...);
The second number roughly corresponds to the local work size in OpenCL, and the product of the two numbers roughly corresponds to the global work size in OpenCL. In CUDA, there is no other way to specify the "global thread space" (corresponding to the global work size in OpenCL, i.e. the total number of work-items launched in OpenCL).
In CUDA, then, the "global thread space" is given by the product of these 2 numbers indicated at kernel launch, and therefore we often end up in a situation where it is convenient to specify a grid size that is larger than the needed number of threads (probably determined by problem size rather than grid size). The reasons for this are well covered in various forum posts, but arise fundamentally out the granular nature of grid specification this implies. For example, see here for a discussion of some calculation considerations.
When the grid size is larger than the needed number of threads, it is extremely common (and often necessary, to prevent for example out-of-bounds indexing) to use what I refer to as a "thread check" in the kernel (1D example):
__global__ void kernel(..., size_t N){
size_t idx = blockIdx.x*blockDim.x+threadIdx.x; // get globally unique thread ID
if (idx < N) { // make sure thread will be "in-bounds" for problem space
... // body of kernel
}
}
Even if we wanted to eliminate this "boiler plate" using an automatic mechanism of some sort at the kernel launch point, we have no definition to do so. There is no number provided by the programmer or the launch API(s) that the CUDA runtime could "automatically" use to further limit the number of threads that are launched, less than the grid definition given by the launch configuration arguments.
In OpenCL, however, we have a separate, independent definition of the "global thread space", namely the global_work_size
parameter of clEnqueueNDRangeKernel. This argument is provided independently of any other launch parameter, and therefore we have an "independent" definition of the "global thread space" (the global work-item space) which isn't subject to any "granular specification" necessities. The launch mechanism therefore can and does limit the global work-item space/size to be equal to this number. In situations where that suffices for the actual problem size, no further "boiler plate" conditioning of active work-items is necessary.
As an aside, not really relevant to the question, OpenCL extends the "knowledge" of this global work-item "space" into the kernel api as well, using for example get_global_size() and relatedly get_global_id(). CUDA has no corresponding definition, and therefore, no corresponding kernel API. Instead, CUDA programmers will typically determine global grid dimensions using a product of the supplied dimension built-in variables (and will typically assemble a globally unique ID using the canonical arithmetic - for the 1D case - that I have already indicated in the kernel example above).
My uses of "CUDA" above should primarily have CUDA C++ in view. There are some minor differences when we are talking about another CUDA language binding, such as Numba CUDA python, however the general idea of a lack of an independent global space definition applies there as well.