Is branch divergence really so bad?

Question

I've seen many questions scattered across the Internet about branch divergence, and how to avoid it. However, even after reading dozens of articles on how CUDA works, I can't seem to see how avoiding branch divergence helps in most cases. Before anyone jumps on on me with claws outstretched, allow me to describe what I consider to be "most cases".

It seems to me that most instances of branch divergence involve a number of truly distinct blocks of code. For example, we have the following scenario:

if (A):
  foo(A)
else:
  bar(B)

If we have two threads that encounter this divergence, thread 1 will execute first, taking path A. Following this, thread 2 will take path B. In order to remove the divergence, we might change the block above to read like this:

foo(A)
bar(B)

Assuming it is safe to call foo(A) on thread 2 and bar(B) on thread 1, one might expect performance to improve. However, here's the way I see it:

In the first case, threads 1 and 2 execute in serial. Call this two clock cycles.

In the second case, threads 1 and 2 execute foo(A) in parallel, then execute bar(B) in parallel. This still looks to me like two clock cycles, the difference is that in the former case, if foo(A) involves a read from memory, I imagine thread 2 can begin execution during that latency, which results in latency hiding. If this is the case, the branch divergent code is faster.

The short answer is that branching is only problematic when all threads in a warp don't follow the same path through a branch. When that happens, you get instruction replay, which reduces instruction throughput and performance. *But* the compiler is smart enough to translate "minor" branching into conditional execution, which is basically penalty free. So this is normally a vastly overstated problem. — talonmies, Jun 20 '13 at 21:38
@talonmies I don't know some of those terms. Instruction replay? I'd assume that means executing the same instruction in serial on different threads in the same warp, but the only [definition](http://stackoverflow.com/questions/7187489/why-does-cuda-profiler-indicate-replayed-instructions-82-global-replay-lo) I found seems to suggest that instruction replay is caused by **not** executing all instructions sent from the host, in which case I'm not sure what you mean. As for conditional execution, the only definition I'm aware of for that is synonymous with branch divergence. — longbowrocks, Jun 20 '13 at 22:12
Branch divergence has different levels of overhead. In many cases the compiler can use predication to eliminate the branch or change to the warp active mask. In these cases you will only be impact by the instruction execution latency to do the condition test and set the predicate. If the code actually diverges then the question is how many additional instructions were executed to executed multiple code paths. The branch itself and book keeping for divergence can add overhead and cause the warp to stall waiting for the branch address resolution and fetch. This can be hidden by higher occupancy. — Greg Smith, Jun 21 '13 at 00:10
I just want to point out a possible mistake in your question. you mentioned that " the difference is that in the former case, if foo(A) involves a read from memory, I imagine thread 2 can begin execution during that latency" but I think this won't happen because thread 1 and thread 2 are in the same warp, and that means they always execute the same instruction (although one of them may be masked while the other one is not) — night, Jul 14 '13 at 12:19
There is an interesting paper on reducing branch divergence: http://www.eecis.udel.edu/~cavazos/cisc879/papers/a3-han.pdf. It nicely explains what kinds of optimization are available when the two branches are not truly distinct. I don't believe that true distinctness happens that often. — bcmpinc, May 01 '14 at 12:27

score 51 · Accepted Answer · answered Jun 20 '13 at 21:45

You're assuming (at least it's the example you give and the only reference you make) that the only way to avoid branch divergence is to allow all threads to execute all the code.

In that case I agree there's not much difference.

But avoiding branch divergence probably has more to do with algorithm re-structuring at a higher level than just the addition or removal of some if statements and making code "safe" to execute in all threads.

I'll offer up one example. Suppose I know that odd threads will need to handle the blue component of a pixel and even threads will need to handle the green component:

#define N 2 // number of pixel components
#define BLUE 0
#define GREEN 1
// pixel order: px0BL px0GR px1BL px1GR ...


if (threadIdx.x & 1)  foo(pixel(N*threadIdx.x+BLUE));
else                  bar(pixel(N*threadIdx.x+GREEN));

This means that every alternate thread is taking a given path, whether it be foo or bar. So now my warp takes twice as long to execute.

However, if I rearrange my pixel data so that the color components are contiguous perhaps in chunks of 32 pixels: BL0 BL1 BL2 ... GR0 GR1 GR2 ...

I can write similar code:

if (threadIdx.x & 32)  foo(pixel(threadIdx.x));
else                   bar(pixel(threadIdx.x));

It still looks like I have the possibility for divergence. But since the divergence happens on warp boundaries, a give warp executes either the if path or the else path, so no actual divergence occurs.

This is a trivial example, and probably stupid, but it illustrates that there may be ways to work around warp divergence that don't involve running all the code of all the divergent paths.

So basically what you're saying is not that branch divergence is bad in general, but rather that it should be eliminated when an opportunity to exploit the SIMD architecture of a GPU (as in your example above) is avaiable? — longbowrocks, Jun 20 '13 at 21:58
Yes. Branch divergence is bad (to some degree) anywhere. CPU manufacturers have put a large amount of effort into CPU speculative execution and branch prediction to address the negative effects of branch divergence. On the GPU, the effect of it is pretty clear. If you can avoid it by clever re-architecting of your algorithm, that's good. But if not, the machine at leasts gives you the flexibility to code that way so the programmer's job is more tolerable. The fact that the machine can handle branches in code is a benefit over some more rigid architectures. — Robert Crovella, Jun 20 '13 at 22:59
Ok, you're getting into branch prediction, which I assume means your post above covers the GPU-specific concerns with regard to branch divergence, and that I've understood you correctly. Thanks a bunch! — longbowrocks, Jun 20 '13 at 23:11
I'm suggesting that branching is costly (whether CPU or GPU). CPU manufacturers have addressed it to some degree with certain methods (e.g. branch prediction, speculative execution). On the GPU side, there is no branch prediction or speculative execution. But regardless, divergence costs something whether in GPU code or CPU code. — Robert Crovella, Jun 20 '13 at 23:17
Ah, just read up on instruction pipelining. I see what you mean now. Thank you for not leaving me to wallow in my ignorance. :D — longbowrocks, Jun 21 '13 at 00:48

Is branch divergence really so bad?

1 Answers1

Linked