Why do CUDA kernels have to check `if (index < n)` before doing anything?

Question

This is the very definition of annoying boilerplate. Clearly the kernel should never be invoked where index >= n.

OpenCL doesn't require that you do such a bounds check, it is already done for you outside of the kernel.

They don’t have to. There are many design patterns, that is one very common one — talonmies, Oct 05 '22 at 11:56
One possible answer is that OpenCL came later and could learn from CUDA. BTW: Try running a grid size 0 and check the error code if you want to have another example of annoying boiler plate (hint: It's not a no-op). — Homer512, Oct 05 '22 at 12:10
In Cuda the indices are independent / orthogonal, in OpenCL the global id extends the local id. Both ways have their uses. Even in OpenCL you need a bounds check, if one thread processes more than one element. In Cuda you only need a bounds check, if you combine several indices. Each individual index does not exceed its maximum. — Sebastian, Oct 07 '22 at 07:15
@RobertCrovella That makes perfect sense. If this was an answer rather than a comment I'd select it as correct! — barneypitt, Oct 08 '22 at 09:07

score 3 · Answer 1 · answered Oct 05 '22 at 12:32

3

OpenCL / CUDA kernels run in parallel with thousands/millions of threads. These threads are grouped in workgroups of 32/64/128/256 threads, ideally 64. Within each workgroup threads can for example communicate data via local/shared memory, and restrictions apply for branching, such as if one thread is in slow branch A and all others in fast branch B, all threads in the workgroup have to wait til the one finishes A.

Workgroups are always executed as entire workgroups. That means: If you have a global size of 50 threads, each of which acts on one entry of an array of size 50, and workgroup size 32, you will have 2 workgroups. The first one (threads 0-31) is full, and every thread can read/write its entry in the array. But the second workgroup (threads 32-63) only partially accesses valid data in the array in positions (32-49), and threads (50-63) would read/write in unallocated memory space, causing a crash.

To prevent crashes from out-of-bounds memory access by the last partially filled workgroup, usually a guard clause if(thread_index>=global_range) return; is introduced, so the dummy threads in the last partially filled workgroup return right away and don't access memory.

Only if you make sure by other sanity checks on the CPU side that your array size always is a clean multiple of the workgroup size, you can omit this guard clause.

answered Oct 05 '22 at 12:32

ProjectPhysX

4,535
2
14
34

I don't wish to appear ungrateful, but I know all of that! The question is why is CUDA making the user have to perform the check? As I said, OpenCL is smart enough to do it for you. – barneypitt Oct 06 '22 at 05:51
1

This assumption is wrong. It's always up to the user. OpenCL does not do the guard clause automatically, and you can get the same nasty crashes if you omit it. Note that these crashes appear only sporadic, on some hardware and in some circumstances, because it depends on what is in the unallocated memory space (nothing important, another array with data that then gets corrupted, or the memory region might not even exist physically). – ProjectPhysX Oct 06 '22 at 06:12
I'm practically certain get_global_id() in OpenCL will never return an index which would be illegal or unsafe to use. I've a fair amount of experience with OpenCL and I've never, ever seen a kernel in which anyone makes this check manually, or read advice saying you should do so. I can't actually find an _explicit_ assurance of this in the specification. Though the specification _not_ specifying that you have to check that the returned value is legal is a pretty firm indication that you don't. – barneypitt Oct 08 '22 at 09:12
1

@barneypitt Perhaps it would not return a value above the specified number of ids. That does not mean, it is safe to use. A lot of halfway complex kernels do more than one basic operation and use the ids for several purposes (even for only copying data from source to destination). It depends on how you use the id, whether it is legal or safe. If you access memory with an index, or an id, that is beyond the size of the allocated memory, you would fail. Often kernels need special processing at the boundary of memory, e.g. pad with zero, mirror or repeat and cannot just not process that position. – Sebastian Oct 08 '22 at 12:12
1

@ProjectPhysX For Cuda some of the explanation (e.g. the wait because of divergent threads at branches) is true for warps (with full active size 32), some of the explanation (e.g. access of shared memory) is true for blocks (with size up to 1024 threads). – Sebastian Oct 08 '22 at 12:25
@Sebastian Sure, I'm not saying you don't need to bounds-check indices in OpenCL period. `a = b[get_global_index() + 1]` is of course not safe. – barneypitt Oct 08 '22 at 13:45

Robert Crovella · Accepted Answer · 2022-10-08T16:12:23.903

Shorter:

CUDA has no independent definition of the global thread space, that could be used to limit the total threads to be consistent with the problem size. Therefore this limiting, if needed, must be done in kernel code. OpenCL provides an independent definition of the global work-item space/size, and this is used by the launch mechanism to make sure that no more than that many work-items are launched. If this suffices for problem size, no further in-kernel conditioning is necessary.

Longer:

In CUDA, the only definition of the "global thread space" (CUDA typically might call this "the grid") that we have is given by the grid dimension (number of blocks) and the block dimension (number of threads per block) in the kernel-launch syntax, for example using typical CUDA runtime API syntax:

kernel<<<number_of_blocks, threads_per_block,...>>>(...);

The second number roughly corresponds to the local work size in OpenCL, and the product of the two numbers roughly corresponds to the global work size in OpenCL. In CUDA, there is no other way to specify the "global thread space" (corresponding to the global work size in OpenCL, i.e. the total number of work-items launched in OpenCL).

In CUDA, then, the "global thread space" is given by the product of these 2 numbers indicated at kernel launch, and therefore we often end up in a situation where it is convenient to specify a grid size that is larger than the needed number of threads (probably determined by problem size rather than grid size). The reasons for this are well covered in various forum posts, but arise fundamentally out the granular nature of grid specification this implies. For example, see here for a discussion of some calculation considerations.

When the grid size is larger than the needed number of threads, it is extremely common (and often necessary, to prevent for example out-of-bounds indexing) to use what I refer to as a "thread check" in the kernel (1D example):

__global__ void kernel(..., size_t N){
  size_t idx = blockIdx.x*blockDim.x+threadIdx.x;  // get globally unique thread ID
  if (idx < N) { // make sure thread will be "in-bounds" for problem space
    ...   // body of kernel
  }
}

Even if we wanted to eliminate this "boiler plate" using an automatic mechanism of some sort at the kernel launch point, we have no definition to do so. There is no number provided by the programmer or the launch API(s) that the CUDA runtime could "automatically" use to further limit the number of threads that are launched, less than the grid definition given by the launch configuration arguments.

In OpenCL, however, we have a separate, independent definition of the "global thread space", namely the global_work_size parameter of clEnqueueNDRangeKernel. This argument is provided independently of any other launch parameter, and therefore we have an "independent" definition of the "global thread space" (the global work-item space) which isn't subject to any "granular specification" necessities. The launch mechanism therefore can and does limit the global work-item space/size to be equal to this number. In situations where that suffices for the actual problem size, no further "boiler plate" conditioning of active work-items is necessary.

As an aside, not really relevant to the question, OpenCL extends the "knowledge" of this global work-item "space" into the kernel api as well, using for example get_global_size() and relatedly get_global_id(). CUDA has no corresponding definition, and therefore, no corresponding kernel API. Instead, CUDA programmers will typically determine global grid dimensions using a product of the supplied dimension built-in variables (and will typically assemble a globally unique ID using the canonical arithmetic - for the 1D case - that I have already indicated in the kernel example above).

My uses of "CUDA" above should primarily have CUDA C++ in view. There are some minor differences when we are talking about another CUDA language binding, such as Numba CUDA python, however the general idea of a lack of an independent global space definition applies there as well.

Thank you, it's a very clear explanation. I'm coming to CUDA from OpenCL and I've been led to believe that CUDA was somehow "higher level" and therefore easier to learn than OpenCL. This has not been my experience, generally. The above is a good example of OpenCL operating at a higher level. From what I've seen so far, neither is substantially higher level than the other (device side). Though C++ syntax support I guess adds a modicum of usability to CUDA. — barneypitt, Oct 08 '22 at 16:20
CUDA has 2 usage models (at least) generally referred to as the driver API and the runtime API. In my view, there is almost a one-to-one correspondence between CUDA driver API usage and "classical" OpenCL API usage. So I don't see one as being higher level than the other, as far as that goes. Some might say the runtime API kernel launch mechanism (the `<<<...>>>`) (also context management) simplify or abstract away some complexity, but perhaps beauty is in the eye of the beholder. There are various things that differentiate CUDA and OpenCL, but for basic syntax, they are highly comparable. — Robert Crovella, Oct 08 '22 at 16:30

Why do CUDA kernels have to check `if (index < n)` before doing anything?

2 Answers2