Assume a CUDA kernel executed by a single warp (for simplicity) reaches an if
-else
statement, where 20 of the threads within the warp satisfy condition
and 32 - 20 = 12 threads do not:
if (condition){
statement1; // executed by 20 threads
else{
statement2; // executed by 12 threads
}
According to the CUDA C Programming Guide:
A warp executes one common instruction at a time [...] if threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.
And therefore the two statements would be executed sequentially in separate cycles.
The Kepler architecture contains 2 instruction dispatch units per warp scheduler, and therefore has the ability to issue 2 independent instructions per warp to be at each cycle.
My question is: in this setting with only two branches, why could statement1
and statement2
not be issued by the two instruction dispatch units for simultaneous execution by the 32 threads within the warp, i.e. 20 threads execute statement1
while the 12 others simultaneously execute statement2
? If the instruction scheduler is not the reason why a warp executes a single common instruction at a time, what is? Is it the instruction set that only provides 32-thread wide instructions? Or a hardware-related reason?