0

This is one issue about this SO answer (as the comment said, I opened one new post to ask). I tried searching the comments for "ALU", but still can't understand why "conditional evaluation at the ALU" causes no divergence.

Here I rephrased the original context verbatim (only add some highlights and replace the link with the direct SO answer link for the better view):

More on this topic can be found e.g. in this SO thread, quoting:

For simpler conditionals, NVIDIA GPUs support conditional evaluation at the ALU, which causes no divergence, and for conditionals where the whole warp follows the same path, there is also obviously no penalty.

After reading this valuable paper, SIMT will runs same instructions in the branch path to all the related threads in the warp (So some in the warp may be idle, i.e. divergent).

Q: Why does "conditional evaluation at the ALU" cause no branch divergence? Does it only apply to the condition that "ALUs scheduled with a masked thread"? So all threads results are masked (i.e. predicated) then no divergence.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
zg c
  • 113
  • 1
  • 1
  • 7

1 Answers1

2

Think of conditional execution as an instruction that every thread executes but that has no visible effect on some threads. This means you still incur the cost of running the instruction but you save all additional cost that divergence would have caused.

As for what divergence costs extra, I cite Piotr Bialas, Adam Strzelecki: Benchmarking the cost of thread divergence in CUDA:

Scalar Multiprocessor (SMX) processor maintains for each warp an active mask that indicates which threads in warp are active. When about to execute a potentially diverging instruction (branch) compiler issues one set- synchronization SSY instruction. This instruction causes new synchronization token to be pushed on the top of synchronization stack […] The actual divergence is caused by the predicated branch instruction […]

The instruction may have a pop-bit set, also denoted as synchronization command […] When encountered it signals the stack unwinding. The token is popped from the stack and used to set the active mask and the program counter.

We have estimated the cost of a diverging branch instruction to be exactly 32 cycles on Kepler architecture provided that the maximum stack length did not exceed 16

So there is at least one more instruction involved in branch divergence plus some form of hardware stack and a whole lot of book-keeping to re-synchronize divergent threads.

Follow-up questions

could you give some examples where "conditional evaluation at the ALU" is not used and divergence occurs?

I'd say divergence happens on a predicated branch if not all threads in a warp go the same way (branch taken or not taken).

Simple predication isn't suitable for everything. Divergence is used for loops. You could in theory make a loop without diverging branches by predicating the whole function body and using a warp-vote function to check when to stop the loop. I've not seen the compiler do this, though. You can see it in the SASS output such as here on godbolt.

You also use divergence to skip over proper function calls. The only alternative would be for the function to accept a predicate parameter.

It is also used for large code blocks. If we follow the assumption that a predicated instruction still incurs its runtime cost but has no effect, then skipping over a large code block is sensible, especially if there is a chance that all threads go the same way.

Consider it like this: An if-else built with pure predication always costs as much as running both branches, since all threads go through all instructions. The same if-else built with branches where all threads follow the same path has only the cost of that path plus the synchronization overhead.

Note also that you can have predicated branches without divergence and synchronization. If the compiler can prove that divergence cannot happen (e.g. because the condition depends on a value that is equal for all threads), then the compiler will not issue an SSY instruction

Homer512
  • 9,144
  • 2
  • 8
  • 25
  • Thanks for the reply. 1. Then does "support conditional evaluation at the ALU" means that calculation of **predicate register** like `ISETP.LT.AND P0, PT, R5, 0x1, PT;` (i.e. *one* more instruction involved)? So all instructions are runned and results are predicated to produce the right results. If what I stated in the comment has no problem, could you give some examples where "conditional evaluation at the ALU" is not used and divergence occurs? – zg c Jul 31 '23 at 01:33
  • @zgc I don't understand what you are saying – Homer512 Jul 31 '23 at 07:04
  • Sorry. Maybe I didn't speak clearly. 1. The original question says "NVIDIA GPUs support **conditional evaluation at the ALU**, which causes no divergence". Does "conditional evaluation at the ALU" refer to calculation of the predicate register `ISETP.LT.AND P0, PT, R5, 0x1, PT;` in some cases (i.e. **one** more instruction involved) which is shown in your referenced paper? Then all threads in the warp runned the same instruction `@P0 BRA 0x120;` (also shown in one specific example in the paper) based on the predicate register `P0`, this implies no branch divergence. – zg c Jul 31 '23 at 07:47
  • If it's that case, could you give some examples where "conditional evaluation at the ALU" is **not used** and divergence occurs? – zg c Jul 31 '23 at 07:50
  • @zgc I hope the extended answer is sufficient – Homer512 Jul 31 '23 at 08:54
  • Thanks for the detailed complement. However, I still has some small questions. 1. From "Consider it like this" context, it seems that "conditional evaluation" means "predication", is that case? 2. Since you said "I'd say divergence happens on a predicated branch if **not all threads** in a warp go the **same** way (branch taken or not taken).", then why does the quotes of the original question say "conditional evaluation at the ALU, which causes no divergence"? – zg c Jul 31 '23 at 10:07
  • 1
    @zgc I don't know whether "conditional evaluation" has a specific definition. I would use it as a synonym for predication but the other SO answer seems to use it as a more general term that also encompasses the active mask and diverging branches. "divergence" on the other hand has a very specific meaning. It is when a predicated branch is encountered and threads from the same warp have different predication values. – Homer512 Jul 31 '23 at 10:19
  • Thanks for your cordial help. I undertand now. My original question is about the ambiguity of "conditional evaluation". If it means **predication** as your answer **assumed**, then all are reasonable. – zg c Jul 31 '23 at 11:03
  • Could I ask one more small question about what you said: Does "a warp-vote function to check when to stop the loop" mean the same as "majority/minority-vote strategy" in the [paper](https://www.eecis.udel.edu/~cavazos/cisc879/papers/a3-han.pdf) referenced in the question? – zg c Jul 31 '23 at 14:57
  • 1
    @zgc I was thinking of using the [`__all_sync` function](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-vote-functions). That paper seems to do something more advanced but it's calling `__ballot` (now `__ballot_sync`), so it's based on the same functions – Homer512 Jul 31 '23 at 15:55