For what kind of application would this be useful?
As indicated in the comments, one possible application would be for asynchronous access to a ring buffer. atomicInc
with its rollover characteristic means that when a consumer wanted access to an element beyond the "end" of the ring buffer, the indexing (returned by atomicInc
) would roll-over to the beginning, i.e. 0. Although it is not a ring-buffer example, the idea of using atomics for buffer access is covered here. The ring buffer with atomicInc
could be a trivial extension to it. I'm not going to provide a tutorial on ring buffers. For example, you would likely have both an input and an output index, maintained separately, via two different routines that both would use atomicInc()
.
Is this a common code pattern in CUDA programming?
Based on my experience, it is nowhere near as common as the use of atomicAdd
, for example.
If so, is there dedicated hardware for performing this operation which makes atomicInc() better than an equivalent operation using atomicCAS()?
Yes, there is a dedicated (SASS) instruction and hardware for it. Any atomic RMW op on a single variable can be "emulated" using a compare-and-swap, but only with a retry loop. (As in the canonical testing loop method.) This is less efficient and less performant, generally, than having a dedicated instruction, which doesn't require a loop for proper implementation.
Without this instruction, one option for a ring buffer would be to let a counter run free using atomicAdd
of 1
, and have every user do the modulo before using it (to index a ring buffer). (This is what CPU implementations typically do; CPUs often have atomic-add but not a custom-wrapping atomic-inc.) But that only works cleanly for power-of-2 buffer sizes, so wrapping at the type width doesn't introduce an offset. And so it can be done efficiently with just an AND operation, not actual division or a multiplicative inverse.
The other option is a CAS retry loop, where you manually do the wrapping on the old value, then try to CAS in the new value. This allows indexing non-power-of-2 buffers, but is inefficient.
atomicInc
is (about?) as fast as atomicAdd
, both supported by hardware. Using it is slightly more efficient in the power-of-2 case (avoiding an & (size-1)
on an atomicAdd
result), and avoids CAS retry loops for the non-power-of-2 case.