CUDA: Avoiding serial execution on branch divergence

Question

Assume a CUDA kernel executed by a single warp (for simplicity) reaches an if-else statement, where 20 of the threads within the warp satisfy condition and 32 - 20 = 12 threads do not:

if (condition){
    statement1;     // executed by 20 threads
else{
    statement2;     // executed by 12 threads
}

According to the CUDA C Programming Guide:

A warp executes one common instruction at a time [...] if threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.

And therefore the two statements would be executed sequentially in separate cycles.

The Kepler architecture contains 2 instruction dispatch units per warp scheduler, and therefore has the ability to issue 2 independent instructions per warp to be at each cycle.

My question is: in this setting with only two branches, why could statement1 and statement2 not be issued by the two instruction dispatch units for simultaneous execution by the 32 threads within the warp, i.e. 20 threads execute statement1 while the 12 others simultaneously execute statement2? If the instruction scheduler is not the reason why a warp executes a single common instruction at a time, what is? Is it the instruction set that only provides 32-thread wide instructions? Or a hardware-related reason?

score 9 · Accepted Answer · edited Jun 20 '20 at 09:12

9

Each and every kernel instruction is always executed for all of the threads within a warp. Therefore it is logically not possible to carry out different instructions on different threads within the same warp at the same time. This would be against the SIMT execution model upon which GPUs are built. To your question:

The Kepler architecture contains 2 instruction dispatch units per warp scheduler, and therefore has the ability to issue 2 independent instructions per warp to be at each cycle.

...

why could statement1 and statement2 not be issued by the two instruction dispatch units for simultaneous execution by the 32 threads within the warp, i.e. 20 threads execute statement1 while the 12 others simultaneously execute statement2?

I am not sure whether you realize this, but if statement1 and statement2 are computationally independent, then they can be executed in one cycle:

Instruction from statement1 will be carried out on all threads,
Instruction from statement2 will be carried out on all threads within the same cycle as it was dispatched as well thanks to the second dispatch unit.

Thats how the branch divergence works in GPUs in general, some further reading can be found e.g. here. As a result I believe you are already getting what you ask for for free - both statements are executed within the same cycle (or can be).

EDIT:

As talonmies stated in the comment, it may be worth mentioning conditional execution as it sometimes help to prevent penalty from branch divergence. More on this topic can be found e.g. in this SO thread, quoting:

For simpler conditionals, NVIDIA GPUs support conditional evaluation at the ALU, which causes no divergence, and for conditionals where the whole warp follows the same path, there is also obviously no penalty.

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 27 '15 at 13:37

Michal Hosala

5,570
1
22
49

4

Probably worth mentioning that the hardware supports conditionally predicated instructions and the compiler will generally prefer conditional execution over branch divergence for short code sections. – talonmies Apr 27 '15 at 13:53
If I understand you correctly, the two statements would be executed within the same instruction cycle, but not concurrently due to the SIMD/SIMT architecture of the GPU? To reformulate then, my question would be why the GPU cannot violate SIMD to have two different statements executed concurrently by threads within a warp. Or is SIMD a hardware feature of GPUs, i.e. 32 cores executing threads within a warp cannot physically execute different instructions concurrently? – lodhb May 17 '15 at 12:54
Regarding conditional execution vs. branch divergence: it's unclear to me what "simpler conditionals" lead to conditional evaluation at the ALU with therefore no branch divergence penalty. How can the programmer influence this? It would seem to me that conditional execution is always preferable to branch divergence. – lodhb May 17 '15 at 12:59
@lodhb did you read the linked documents? Because it seems to me like you are using terms SIMD and SIMT interchangeably while that is not true. SIMD has basicaly nothing to do with GPUs. I believe you would understand my answer once you get good understanding of SIMT and branch divergence in CUDA. As for you later comment - not always preferable, but definitely preferable in situation when condition evaluates to the same value within each thread belonging to a particular warp. – Michal Hosala May 17 '15 at 16:06
I do believe I understand the difference between SIMD and SIMT. [NVIDIA themselves](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#axzz3aPo6sXpQ) state that the two are very similar, I did not feel it was important in my comment to distinguish them. My question is: why can `statement1` and `statement2` not be executed *simultaneously*, not only during the same instruction cycle but actually at the same time? Is SIMT imposed at the hardware level such that the threads cannot execute different instructions concurrently? Regarding conditional execution: when is it not preferable? – lodhb May 17 '15 at 17:00
@lodhb "why can statement1 and statement2 not be executed simultaneously" -- that is not true, where in my answer am I stating that? I am saying that "if statement1 and statement2 are computationally independent, then they can be executed in one cycle". So in the same cycle all of the 32 threads within a single warp _can_ execute statement1 and statement2 and all of them _will_ execute both these statements if there is at least one thread where condition is true (for statement1) or false (for statement2). – Michal Hosala May 17 '15 at 17:25
*"Therefore it is logically not possible to carry out different instructions on different threads within the same warp at the same time. [...] both statements are executed within the same cycle"* Unless I am misunderstanding you, you are saying that they can execute within the same instruction cycle, but not at the same time (i.e. they are sequentially executed one after the other by the 32 threads within the same instruction cycle). – lodhb May 17 '15 at 17:28
@lodhb You are not reading carefully, nor understanding correctly, I ll try once more, but you should really get familiar with SIMT and branch divergence because otherwise my comments are pointless. Also the page you've linked clearly states various **fundamental** differences between SIMD and SIMT, it definitelly does not say that the two are _very similar_ as you claim, section 4.1 SIMT architecture. More in the upcoming comment. – Michal Hosala May 17 '15 at 17:35
To the part of my answer you are quoting - **different instructions** - however, statement1 is executed **by all threads in a warp regardless of the condition** and **the same is true for the statement2** as long as there is at least one thread which will need to execute statement1 or statement2 respectively - this is how the branch divergence works in SIMT. Therefore if part of threads have condition true and part false (within the same warp) then first all 32 threads execute statement1 and then all of them executes statement2. And it _can_ happen simultaneously because ... – Michal Hosala May 17 '15 at 17:37
"The Kepler architecture contains 2 instruction dispatch units per warp scheduler, and therefore has the ability to issue 2 independent instructions per warp to be at each cycle." HOWGH. – Michal Hosala May 17 '15 at 17:37
I had already understood how branch divergence works on SIMT thanks to your original answer. What I think we are not understanding each other on is the *simultaneous* part, and you said it yourself, *"if part of threads have condition true and part false (within the same warp) then first all 32 threads execute statement1 **and then** all of them executes statement2."* The 2 instruction dispatch units can *issue* the two instructions simultaneously, but the 32 CUDA cores cannot *execute* the two statements simultaneously - right? – lodhb May 17 '15 at 18:25
1

@lodhb Right, but the second instruction can be, I believe, dispatched for execution to completely different group of 32 physical cores. Therefore execution of statement1 and statement2 can happen simultaneously but it will require 64 physical cores. – Michal Hosala May 17 '15 at 18:30
@lodhb nice picture can be found for example in [gk110 whitepaper](http://www.nvidia.com/content/pdf/kepler/nvidia-kepler-gk110-architecture-whitepaper.pdf) at page 10. – Michal Hosala May 17 '15 at 18:34
Aha, now this is interesting! Does this mean that a warp (32 threads) can be "spread over" more than 32 cores at a time? This would then be instruction-level parallelism, right? Do you have a source for this? – lodhb May 17 '15 at 18:39
If this is true, this would mean that the 32 threads all execute two instructions on two different cores simultaneously? Doesn't this violate the condition that ["a warp executes one common instruction at a time"](http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture) – lodhb May 17 '15 at 18:53
@lohdb See for example [here](http://stackoverflow.com/a/18070261/3242721), in particular "With most newer GPUs, you can certainly get improved performance through instruction level parallelism, by having your thread code have multiple independent instructions in sequence.". But I don't have any offical document saying that, I always was taking it as a fact. Anyway, I believe we are way beyond the scope of the original question :) So please feel free to start a new SO question if in doubts. – Michal Hosala May 17 '15 at 19:01

CUDA: Avoiding serial execution on branch divergence

1 Answers1

Linked