In many sources implementing critical section is suggested through atomicCAS locking mechanism, for example the accepted answer here or in "CUDA by Example: An Introduction to General-Purpose GPU Programming" (A.2.4, pages 272-273, add_to_table
).
However, I'm not sure this approach is safe. What if a block gets pre-empted while one of its threads holds a lock, and all the resident blocks are busy-waiting on this lock? Some sources suggest that there should be at most as much blocks launched as can become resident simultaneously. Such a solution seems inapplicable if at the same time an unknown amount of other tasks can be scheduled on the device. Besides, even if the block containing the lock-holding thread is resident, this thread might never be scheduled, while the SM is occupied by other busy-waiting threads?