CUDA atomics causes branch divergence

Question

I'm developing a CUDA kernel to compute the histogram of an image (NVIDIA GTX 480). I've noticed using the cuda profiler that an 82.2% of branch divergence was found. The profiler indicates the following function as the source of the divergence, located in a file named device_functions.h (in particular the line containing the return statement).

static __forceinline__
unsigned int __uAtomicAdd(unsigned int *p, unsigned int val)
{
  return __nvvm_atom_add_gen_i((volatile int *)p, (int)val);
}

Is it correct to say that atomic operations cause branch divergence?

Is this particular usage involving shared memory atomics, or global memory atomics, or both? It is straightforward to explain the reason why shared memory atomics will involve branch divergence. I'm not sure about the global memory case, but it may be that a replay mechanism or similar might cause the profiler to view a global atomic as a divergent case. — Robert Crovella, Feb 02 '16 at 15:04

score 5 · Accepted Answer · edited May 23 '17 at 12:23

To some degree atomic implementation in CUDA may vary by GPU architecture. But specifically for the GTX 480 (a Fermi-class GPU), __shared__ memory atomics are implemented not as a single machine instruction, but in fact by a sequence of machine (SASS) instructions that form a loop.

This loop is essentially contending for a lock. When the lock is acquired by a particular thread, that thread will then complete the requested memory operation atomically on the identified shared memory cell, and then release the lock.

The process of looping to acquire the lock necessarily involves branch divergence. The possibility for branch divergence in this case is not evident from the C/C++ source code, but will be evident if you inspect the SASS code.

Global atomics are generally implemented as a single (ATOM or RED) SASS instruction. However global atomics may still involve serialization of access if executed by multiple threads in the warp. I wouldn't normally think of this as a case of "divergence" but I'm not entirely sure how the profiler would report it. If you ran an experiment that involved only global atomics, I think it would become clear.

It's possible that the reported divergence in your case is entirely due to the shared memory divergence (which is expected) as described above.

Thanks @Robert Crovella. You are right, implementing the program with only _global_ memory atomics the profiler outlines a branch_efficiency of 100%. So this is a problem of only _shared_ memory atomics. — Stefano Sandonà, Feb 03 '16 at 10:44

CUDA atomics causes branch divergence

1 Answers1

Linked