3

Suppose many warps in a (CUDA kernel grid) block are updating a fair-sized number of shared memory locations, repeatedly.

In which of the cases will such work be completed faster? :

  1. The case of intra-warp access locality, e.g. the total number of memory position accessed by each warp is small and most of them are indeed accessed by multiple lanes
  2. The case of access anti-locality, where all lanes typically access distinct positions (and perhaps with an effort to avoid bank conflicts)?

and no less importantly - is this microarchitecture-dependent, or is it essentially the same on all recent NVIDIA microarchitectures?

einpoklum
  • 118,144
  • 57
  • 340
  • 684

3 Answers3

3

Anti-localized access will be faster.

On SM5.0 (Maxwell) and above GPUs the shared memory atomics (assume add) the shared memory unit will replay the instruction due to address conflicts (two lanes with the same address). Normal bank conflict replays also apply. On Maxwell/Pascal the shared memory unit has fixed round robin access between the two SM partitions (2 scheduler in each partition). For each partition the shared memory unit will complete all replays of the instruction prior to moving to the next instruction. The Volta SM will complete the instruction prior to any other shared memory instruction.

  1. Avoid bank conflicts
  2. Avoid address conflicts

On Fermi and Kepler architecture a shared memory lock operation had to be performed prior to the read modify write operation. This blocked all other warp instructions.

Maxwell and newer GPUs have significantly faster shared memory atomic performance thank Fermi/Kepler.

A very simple kernel could be written to micro-benchmark your two different cases. The CUDA profilers provide instruction executed counts and replay counts for shared memory accesses but do not differentiate between replays due to atomics and replays due to load/store conflicts or vector accesses.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Greg Smith
  • 11,007
  • 2
  • 36
  • 37
  • So, you're saying that it is never the case that shared mem operations from threads in different warps are executed concurrently in the same clock cycle(s), even if they involve disjoint values in shared memory, in different banks? And thus, that anti-local action is always superior? – einpoklum Aug 02 '18 at 06:31
  • 3
    Each shared memory unit (1 per SM) can perform 1 operation per cycle (either a read or a write). All addresses and data are from one warp instruction (no merging of instructions from different warps). Instructions may require multiple operations due to access size (>32bit), bank conflicts, address conflicts (atomic), or the operation is an atomic required a read/op/write. – Greg Smith Aug 02 '18 at 21:02
  • How do you know this to be the case? – einpoklum Aug 02 '18 at 22:17
  • 2
    @einpoklum: If Greg says it is so, then it is. – talonmies Aug 03 '18 at 08:41
  • @talonmies: But I seem to recall the latency of shared memory operations is an order-of-magnitude higher than that. Or - is Greg talking about throughput? If that's the case, aren't there races between subsequent operations? – einpoklum Aug 03 '18 at 16:55
  • Throughput of the shared memory unit is 1 read or 1 write operation per cycle. An instruction may require multiple operations due to the data width, bank conflicts, address conflicts, or atomic (multi-op). On Maxwell/Pascal and above the latency is the time in the instruction queue to the shared memory unit + number of operations * 2 cycles (Maxwell/Pascal timeslice operations between even and odd SM partitions) or 1 cycle (Volta doesn't timeslice) + the write back to the register file. Due to the queue depth the latency varies. You'll have to clarify subsequent operations for a better answer. – Greg Smith Aug 03 '18 at 22:21
1

There's a quite simple argument to be made even without needing to know anything about how shared memory atomics are implemented in CUDA hardware: At the end of the day, atomic operations must be serialized somehow at some point. This is true in general, it doesn't matter which platform or hardware you're running on. Atomicity kinda requires that simply by nature. If you have multiple atomic operations issued in parallel, you have to somehow execute them in a way that ensures atomicity. That means that atomic operations will always become slower as contention increases, no matter if we're talking GPU or CPU. The only question is: by how much. That depends on the concrete implementation.

So generally, you want to keep the level of contention, i.e., the number of threads what will be trying to perform atomic operations on the same memory location in parallel, as low as possible…

Michael Kenzel
  • 15,508
  • 2
  • 30
  • 39
0

This is a speculative partial answer.

Consider the related question: Performance of atomic operations on shared memory and its accepted answer.

If the accepted answer there is correct (and continues to be correct even today), then warp threads in a more localized access would get in each other's way, making it slower for many lanes to operate atomically, i.e. making anti-locality of warp atomics better.

But to be honest - I'm not sure I completely buy into this line of argumentation, nor do I know if things have changed since that answer was written.

einpoklum
  • 118,144
  • 57
  • 340
  • 684