8

If you have work items executing in a wavefront and there is a conditional such as:

  if(x){
        ...
  }
  else{
       ....
  }

What do the work-items execute? is it the case whereby all workitems in the wavefront will execute the first branch (i.e. x == true). If there are no work-items for which x is false, then the rest of the conditional is skipped?

What happens if one work-item takes the alternative path. Am I told that all workitems will execute the alternate path as well (therefore executing both paths?). Why is this the case and how does it not mess up the program execution

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
Roger
  • 3,411
  • 5
  • 23
  • 22

1 Answers1

15

NVIDIA gpus use conditional execution to handle branch divergence within the SIMD group ("warp"). In your if..else example, both branches get executed by every thread in the diverging warp, but those threads which don't follow a given branch are flagged and perform a null op instead. This is the classic branch divergence penalty - interwarp branch divergence takes two passes through the code section to retire for warp. This isn't ideal, which is why performance oriented code tries to minimize this. One thing which often catches out people is making an assumption about which section of a divergent path gets executed "first". The have been some very subtle bugs cause by second guessing the internal order of execution within a divergent warp.

For simpler conditionals, NVIDIA GPUs support conditional evaluation at the ALU, which causes no divergence, and for conditionals where the whole warp follows the same path, there is also obviously no penalty.

Michal Hosala
  • 5,570
  • 1
  • 22
  • 49
talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Ah I see, that makes sense if they are given a null op. I did wonder as well about which path gets executed first as well. If all work items don't take a branch, will they all execute null ops for the remaining path, or will the remaining path be skipped somehow? – Roger May 05 '11 at 12:56
  • 2
    In fact the NVIDIA approach uses an execution mask for each warp and that determines which threads execute. But the effect is that ALUs scheduled with a masked thread do the equivalent of a NOP. The actual order of execution on NVIDIA cards is undefined, but some clever microbenchmarking has shown that the "else" section of your example executes before the "if" section on current hardware. This catches out a lot of naively designed critical section and spinlocks built with atomic memory transactions.... – talonmies May 05 '11 at 13:00
  • Thanks. One more question I have just thought of. I have noticed some GPUs don't have branch predictors. Why wouldn't they use branch predictors to try and eliminate the need for executing both paths (and save time) – Roger May 05 '11 at 20:13
  • 3
    Remember that these are effectively SIMD or vector machines, so you have one instruction issue unit feeding multiple ALUs. The premise is that for the typical compute and rendering workloads, which are not very "branchy", this is the best use of the transistor budget. Adding things like branch prediction takes transistors away from something else. The parts of a modern CPU which do computation are vanishingly small, and the instruction handling and cache dominate the die area. So there are choices, GPUs take a different path from gern – talonmies May 05 '11 at 20:43
  • @talonmies Would you please elaborate on what is a "null operation" in CUDA? Are all threads still executing the same instructions but for disabled threads any memory-related operations are disabled? Or the disabled threads will execute one NOP operation? Thanks! – biubiuty Sep 24 '12 at 18:15