Work-item branch divergence in OpenCL, how it works?

Question

I'm studying something about OpenCL and I don't understand very well the concept of "work-item divergence or Divergent Control Flow".

As we can see in the picture below, there are some warp or wavefront, depends of the model of the GPU that executes one instruction or another instruction.

Now, my question is: all the warp/wavefront will execute the if condition and later the else condition or only one of these (only the if or only the else) as a normal control flow of a program.

This question can be very stupid, but on the web, I didn't find anything and with other material, I don't understand the point.

Thanks in advance and if there are any problems let me know in the comments!

I'm new on stack about the ask of my personal questions :(

As you see in your graphic, the vertical axis is the time axis, the horizontal axis are the different threads. First the whole warp/wavefront executes the if branch, then the whole warp/wavefront executes the else branch. The threads, for which the branch does not apply, are inactive, but occupy computation resources nevertheless. Depending on the architecture, the details may differ, the hardware may profit from only a few threads participating or skip one of the branches altogether or may switch between if and else several times. — Sebastian, Jun 29 '22 at 19:27
So, if for some warp/wavefront the if condition is true they execute the correspond branch and for example if for one thread the if condition is false it execute only the else right? — , Jun 29 '22 at 20:45
Actually executed (with the use of the result) are those instructions, which would classically be executed. Each instruction, which is executed by at least one thread of the warp/wavefront is executed by all, but with the conditional predicate turned off. — Sebastian, Jun 29 '22 at 21:42

score 2 · Answer 1 · answered Jun 29 '22 at 19:31

The key to understanding the GPU-style SIMD execution model is that all threads in a wavefront/SIMD group always execute the exact same instruction at the same time. If a thread doesn't need to run an instruction that at least one other thread must execute, there won't be any side effects (register values won't change, etc.), but it still costs as much in terms of performance as if it really did run it.

If the branching condition is either true or false for all threads in a wavefront/SIMD group, then all threads only run the one branch, and the other branch is skipped. So if the condition is the same for almost all threads in your workload, or if you can arrange for the condition to be the same for all threads in a group, then you don't pay the divergence cost. (Or it becomes negligible.)

If there is a frequent divergence within the group, the whole wavefront needs to execute both branches. When this happens, the threads which don't need to actually run the code, will still step through those instructions required by the other threads at exactly the same time as those other threads, it just has no effect. Unlike hardware CPU threads, a GPU thread can't run different code from other threads (in the same SIMD group), it can only run the same code on different data, or it has to wait until the other threads have finished the code it doesn't need to run.

So, as my professor said if one thread in warp/wavefront has the if condition false, it will execute the else branch and will be the only one while the others threads execute the if branch. It's like that or I understand wrongly? — , Jun 29 '22 at 21:13
Only the thread(s) that *need* to run a particular section of code will actually *use* the results of executing that branch. The other threads in the SIMD group still need to follow along but there is no change to registers or memory. So it is as if they never ran the code. — pmdj, Jun 30 '22 at 05:38
However the threads that don’t need to run a particular piece of code can’t do anything else while the other threads are running that code. — pmdj, Jun 30 '22 at 08:54

Work-item branch divergence in OpenCL, how it works?

1 Answers1