1

Suppose we have:

  • A single warp (of 32 threads)
  • Each thread t which has 32 int values val t,0...val t,31,
  • Each value val t,i needs to be added, atomically, to a variable desti, which resides in (Option 1) global device memory (Option 2) shared block memory.

Which access pattern would be faster to perform these additions:

  1. All threads atomic-add val t,1 to dest 1.

    All threads atomic-add val t,2 to dest 2.

    etc.

  2. Each thread, with index t, writes val t,t to dest t

    Each thread, with index t, writes val t, (t+1) mod 32 to dest (t+1) mod 32

    etc.

In other words, is it faster when all threads of a warp make an atomic write in the same cycle, or is it better that no atomic writes coincide? I can think of hardware which carries out either of the options faster, I want to know what's actually implemented.

Thoughts:

  • It's possible for a GPU to have hardware which bunches together multiple atomic ops from the same warp to a single destination, so that they actually count as just one, or at least can be scheduled together, and thus all threads will proceed to the next instruction at the same time, not waiting for the last atomic op to conclude after all the rest are done.

Notes:

  • This question focuses on NVidia hardware with CUDA but I'd appreciate answers regarding AMD and other GPUs.
  • Never mind how the threads get them. Assume that they're in registers and there's no spillage, or that they're the result of some arithmetic operation done in-registers. Forget about any memory accesses to get them.
einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • Though I don't know exact answer, my guess is that skewed access is preferable. At least you will be issuing 32 atomic operations per cycle (in ideal case), but in real hardware your algorithm may suffer from poor memory access pattern. – Roman Arzumanyan May 30 '14 at 15:38
  • 2
    This sounds like something that could only be answered with a lot of careful benchmarking. What's stopping you trying it? – talonmies May 30 '14 at 16:00
  • @talonmies: 1. I only have one (or maybe two) cards with which to benchmark. 2. The answer might be "it probably depends on other parameters", in which case I'll take them into account. 3. For the actual problem I'm working on I'm trying to reduce the use of atomics rather than optimize it – einpoklum May 30 '14 at 16:04
  • I would go with the 2nd. 1st access pattern has the same address in each cycle for all the lanes. It can especially be killing the performance for pre-Maxwell CUDA cards when shared memory is the destination (see [here](http://stackoverflow.com/a/19548723/2386951)). Also please have a look at [here](http://stackoverflow.com/q/22367238/2386951) and [here](http://stackoverflow.com/q/23744985/2386951). – Farzad May 30 '14 at 16:09

1 Answers1

1

This is how I understand your problem:

You have a 32x32 matrix of int:

Val0'0, Val1'0,....Val31'0
Val1'0, Val1'1,....Val31'1
.
.
Val31'0, Val31'1,...,Val31'31

And you need to sum each row:
Val0'0 + Val1'0 + ... Val31'0 = dest0
Val0'1 + Val1'1 + ... Val31'1 = dest1 etc.

The problem is that your rows values are distributed between different threads. For a single warp, the cleanest way to approach this would be for each thread to share it's values using shared memory (in a 32x32 shared memory array). After thread sync, thread i sums the i'th row, and writes the results to dest(i) which can reside in global or shared memory (depends on you application). This way the computation work (31x31) additions are divided evenly between the threads in the warp AND you don't need atomic ops (performance killers) at all. From my experience atomic operation usually can and should be avoided by a different distribution of the work between the threads.

Gil Shapira
  • 101
  • 5