Which atomics targetting pattern is better for a GPU warp: All-the-same or all-distinct?

Question

Suppose we have:

A single warp (of 32 threads)
Each thread t which has 32 int values val_t,0...val_t,31,
Each value val_t,i needs to be added, atomically, to a variable dest_i, which resides in (Option 1) global device memory (Option 2) shared block memory.

Which access pattern would be faster to perform these additions:

All threads atomic-add val_t,1 to dest₁.

All threads atomic-add val_t,2 to dest₂.

etc.
Each thread, with index t, writes val_t,t to dest_t

Each thread, with index t, writes val_{t, (t+1) mod 32} to dest_{(t+1) mod 32}

etc.

In other words, is it faster when all threads of a warp make an atomic write in the same cycle, or is it better that no atomic writes coincide? I can think of hardware which carries out either of the options faster, I want to know what's actually implemented.

Thoughts:

It's possible for a GPU to have hardware which bunches together multiple atomic ops from the same warp to a single destination, so that they actually count as just one, or at least can be scheduled together, and thus all threads will proceed to the next instruction at the same time, not waiting for the last atomic op to conclude after all the rest are done.

Notes:

This question focuses on NVidia hardware with CUDA but I'd appreciate answers regarding AMD and other GPUs.
Never mind how the threads get them. Assume that they're in registers and there's no spillage, or that they're the result of some arithmetic operation done in-registers. Forget about any memory accesses to get them.

Though I don't know exact answer, my guess is that skewed access is preferable. At least you will be issuing 32 atomic operations per cycle (in ideal case), but in real hardware your algorithm may suffer from poor memory access pattern. — Roman Arzumanyan, May 30 '14 at 15:38
This sounds like something that could only be answered with a lot of careful benchmarking. What's stopping you trying it? — talonmies, May 30 '14 at 16:00
@talonmies: 1. I only have one (or maybe two) cards with which to benchmark. 2. The answer might be "it probably depends on other parameters", in which case I'll take them into account. 3. For the actual problem I'm working on I'm trying to reduce the use of atomics rather than optimize it — einpoklum, May 30 '14 at 16:04
I would go with the 2nd. 1st access pattern has the same address in each cycle for all the lanes. It can especially be killing the performance for pre-Maxwell CUDA cards when shared memory is the destination (see [here](http://stackoverflow.com/a/19548723/2386951)). Also please have a look at [here](http://stackoverflow.com/q/22367238/2386951) and [here](http://stackoverflow.com/q/23744985/2386951). — Farzad, May 30 '14 at 16:09

score 1 · Answer 1 · answered Jul 08 '14 at 09:11

This is how I understand your problem:

You have a 32x32 matrix of int:

Val0'0, Val1'0,....Val31'0
Val1'0, Val1'1,....Val31'1
.
.
Val31'0, Val31'1,...,Val31'31

And you need to sum each row:
Val0'0 + Val1'0 + ... Val31'0 = dest0
Val0'1 + Val1'1 + ... Val31'1 = dest1 etc.

The problem is that your rows values are distributed between different threads. For a single warp, the cleanest way to approach this would be for each thread to share it's values using shared memory (in a 32x32 shared memory array). After thread sync, thread i sums the i'th row, and writes the results to dest(i) which can reside in global or shared memory (depends on you application). This way the computation work (31x31) additions are divided evenly between the threads in the warp AND you don't need atomic ops (performance killers) at all. From my experience atomic operation usually can and should be avoided by a different distribution of the work between the threads.

Which atomics targetting pattern is better for a GPU warp: All-the-same or all-distinct?

1 Answers1