CUDA critical section CAS lock

Question

In many sources implementing critical section is suggested through atomicCAS locking mechanism, for example the accepted answer here or in "CUDA by Example: An Introduction to General-Purpose GPU Programming" (A.2.4, pages 272-273, add_to_table).

However, I'm not sure this approach is safe. What if a block gets pre-empted while one of its threads holds a lock, and all the resident blocks are busy-waiting on this lock? Some sources suggest that there should be at most as much blocks launched as can become resident simultaneously. Such a solution seems inapplicable if at the same time an unknown amount of other tasks can be scheduled on the device. Besides, even if the block containing the lock-holding thread is resident, this thread might never be scheduled, while the SM is occupied by other busy-waiting threads?

The reasons that I can think of for preemption are: 1. debugging 2. GPU context switch 3. CUDA dynamic parallelism. None of those strike me as concerning for the atomic method. The atomic method in general doesn't require any occupancy considerations, and in the Volta+ case, with appropriate kernel design, I'm not aware of any concerns about starvation. For the example you linked, that is carefully designed so that only 1 thread in the block negotiates for the lock, while all others are at `__syncthreads()`. If you think that suggests a starvation concern, then many many CUDA codes are broken. — Robert Crovella, Jun 12 '23 at 02:06
@RobertCrovella, I think it's likely that the common application of atomicCAS critical section is correct, but I would like to understand what guarantees its correctness to adapt it to my use case. To simplify, I want to do (initially *x == 0 and *y == 0) `if (threadIdx.x == 0 && atomicCAS(x, 0, 1) == 0) atomicCAS(y, 0, 1); __syncthreads(); while (!atomicCAS(y, 1, 1));` and hope that the loop will not be infinite. — maxplus, Jun 12 '23 at 03:31
In your implementation indeed only one thread from a block accesses the lock. But, say, there are 20 SMs, at most one block can be resident per SM, and 21 thread blocks are created in a kernel launch. 20 of them become resident, one acquires the lock, gets preempted in favor of the 21-st block, which joins the other 19 busy-waiting indefinitely. Such a situation seems unlikely, but is it impossible? What if these blocks were launched in a stream, and then in another, higher-priority stream, same kernel was launched, happening to preempt the lock-holder? — maxplus, Jun 12 '23 at 03:34
In "CUDA by Example" (A.2.4, pages 272-273, `add_to_table`), unlike in your answer, a lock is acquired by a single thread in a warp at a time, but multiple warps acquire multiple, not necessarily distinct locks. If there are more warps resident on an SM than can be scheduled, even without block preemption it seems remotely possible that a warp holding a lock will never be scheduled to release a lock, and instead busy-waiting warps will be scheduled indefinitely. I'll readily believe that such an issue is impossible to reproduce. But is it in some way guaranteed to not happen? — maxplus, Jun 12 '23 at 03:35
"gets preempted in favor of the 21-st block, " what? Never heard of it. And regarding the question about higher priority stream; you have control over what you launch and how you design kernels to work emanating from a particular process. The original answer you linked already says: " improper use of critical sections can lead to deadlock." Good luck! — Robert Crovella, Jun 12 '23 at 03:42
Thanks for clarifications. I don't have control over what other kernels will emanate from a process using my library, but at least the end developer does, and he won't have access to *my* lock so the exact example with higher-priority stream is impossible. I'll try to adopt a "design kernels and synchronization that don't break in practice" approach without being able to prove their safety. Although I still wonder if that's the best that can be done, and if that's what thrust/cub/cudnn/... do, given that they can be run concurrently with any well-formed developer code and are supposed to work. — maxplus, Jun 12 '23 at 13:18
I'd like to know the reason for -1. To trigger automatic deletion, even though this question is meaningful ans isn't answered anywhere else? — maxplus, Jul 12 '23 at 20:57

score -1 · Answer 1 · answered Jul 12 '23 at 22:02

I'll welcome an answer from someone more experienced with CUDA, but for now I'll post my own best attempt based on comments under the question from Robert, on my own experiments and on official information provided by NVIDIA.

Currently the only case where CUDA seems to guarantee progress is between threads from a single warp with Independent Thread Scheduling, allowing the lock holder to eventually exit the critical section and release the lock as long as any thread from that warp makes progress.

There are two mechanisms that can alleviate concerns related to lock-holder preemption: cooperative launch and thread clusters. But they provide only co-scheduling guarantee and don't address whether all resident threads make progress. Using built-in blocking group synchronization, weak forward progress guarantee can be achieved.

That said, it is often (and may even be always) the case in practice that all resident threads make progress (with ITS) and in my experience preemption doesn't happen arbitrarily, so a not-strictly-safe but more performant implementation may be preferred if necessary and designed carefully so that in all desired environments it doesn't misbehave, which is a relatively common approach in CUDA (especially pre-Volta).

CUDA critical section CAS lock

1 Answers1