When and why would you use atomicInc() in CUDA?

Question

I've been reading up on atomic operations in CUDA and the update pattern for atomicInc() seems fairly arbitrary.

unsigned int atomicInc(unsigned int* address, unsigned int val);
reads the 32-bit word old located at the address address in global or shared memory, computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This is an increment and "wrap-around" operation. atomicInc(&x, val) is very similar to x = (x+1) % (val+1).

For what kind of application would this be useful? Is this a common code pattern in CUDA programming? If so, is there dedicated hardware for performing this operation which makes atomicInc() better than an equivalent operation using atomicCAS()?

Not a complete answer but that instruction sounds really useful for lockfree ring buffers. Set val to the size of the ring buffer and you can increment the read and write positions with this operation — Homer512, Jul 15 '22 at 06:42

score 5 · Accepted Answer · edited Jul 15 '22 at 22:00

For what kind of application would this be useful?

As indicated in the comments, one possible application would be for asynchronous access to a ring buffer. atomicInc with its rollover characteristic means that when a consumer wanted access to an element beyond the "end" of the ring buffer, the indexing (returned by atomicInc) would roll-over to the beginning, i.e. 0. Although it is not a ring-buffer example, the idea of using atomics for buffer access is covered here. The ring buffer with atomicInc could be a trivial extension to it. I'm not going to provide a tutorial on ring buffers. For example, you would likely have both an input and an output index, maintained separately, via two different routines that both would use atomicInc().

Is this a common code pattern in CUDA programming?

Based on my experience, it is nowhere near as common as the use of atomicAdd, for example.

If so, is there dedicated hardware for performing this operation which makes atomicInc() better than an equivalent operation using atomicCAS()?

Yes, there is a dedicated (SASS) instruction and hardware for it. Any atomic RMW op on a single variable can be "emulated" using a compare-and-swap, but only with a retry loop. (As in the canonical testing loop method.) This is less efficient and less performant, generally, than having a dedicated instruction, which doesn't require a loop for proper implementation.

Without this instruction, one option for a ring buffer would be to let a counter run free using atomicAdd of 1, and have every user do the modulo before using it (to index a ring buffer). (This is what CPU implementations typically do; CPUs often have atomic-add but not a custom-wrapping atomic-inc.) But that only works cleanly for power-of-2 buffer sizes, so wrapping at the type width doesn't introduce an offset. And so it can be done efficiently with just an AND operation, not actual division or a multiplicative inverse.

The other option is a CAS retry loop, where you manually do the wrapping on the old value, then try to CAS in the new value. This allows indexing non-power-of-2 buffers, but is inefficient.

atomicInc is (about?) as fast as atomicAdd, both supported by hardware. Using it is slightly more efficient in the power-of-2 case (avoiding an & (size-1) on an atomicAdd result), and avoids CAS retry loops for the non-power-of-2 case.

When and why would you use atomicInc() in CUDA?

1 Answers1