Suppose we have:
- A single warp (of 32 threads)
- Each thread t which has 32 int values val t,0...val t,31,
- Each value val t,i needs to be added, atomically, to a variable desti, which resides in (Option 1) global device memory (Option 2) shared block memory.
Which access pattern would be faster to perform these additions:
All threads atomic-add val t,1 to dest 1.
All threads atomic-add val t,2 to dest 2.
etc.
Each thread, with index t, writes val t,t to dest t
Each thread, with index t, writes val t, (t+1) mod 32 to dest (t+1) mod 32
etc.
In other words, is it faster when all threads of a warp make an atomic write in the same cycle, or is it better that no atomic writes coincide? I can think of hardware which carries out either of the options faster, I want to know what's actually implemented.
Thoughts:
- It's possible for a GPU to have hardware which bunches together multiple atomic ops from the same warp to a single destination, so that they actually count as just one, or at least can be scheduled together, and thus all threads will proceed to the next instruction at the same time, not waiting for the last atomic op to conclude after all the rest are done.
Notes:
- This question focuses on NVidia hardware with CUDA but I'd appreciate answers regarding AMD and other GPUs.
- Never mind how the threads get them. Assume that they're in registers and there's no spillage, or that they're the result of some arithmetic operation done in-registers. Forget about any memory accesses to get them.