Think of conditional execution as an instruction that every thread executes but that has no visible effect on some threads. This means you still incur the cost of running the instruction but you save all additional cost that divergence would have caused.
As for what divergence costs extra, I cite Piotr Bialas, Adam Strzelecki: Benchmarking the cost of thread divergence in CUDA:
Scalar Multiprocessor (SMX) processor maintains for each warp an active mask that indicates which threads in warp are active. When about to execute a potentially diverging instruction (branch) compiler issues one set-
synchronization SSY instruction. This instruction causes new synchronization token to be pushed on the top of synchronization stack […] The actual divergence is caused by the predicated branch instruction […]
The instruction may have a pop-bit set, also denoted as synchronization command […] When encountered it signals the stack unwinding. The token is popped from the stack and used to set the active mask and the program counter.
We have estimated the cost of a diverging branch instruction to be exactly 32 cycles on Kepler architecture provided that the maximum stack length did not exceed 16
So there is at least one more instruction involved in branch divergence plus some form of hardware stack and a whole lot of book-keeping to re-synchronize divergent threads.
Follow-up questions
could you give some examples where "conditional evaluation at the ALU" is not used and divergence occurs?
I'd say divergence happens on a predicated branch if not all threads in a warp go the same way (branch taken or not taken).
Simple predication isn't suitable for everything. Divergence is used for loops. You could in theory make a loop without diverging branches by predicating the whole function body and using a warp-vote function to check when to stop the loop. I've not seen the compiler do this, though. You can see it in the SASS output such as here on godbolt.
You also use divergence to skip over proper function calls. The only alternative would be for the function to accept a predicate parameter.
It is also used for large code blocks. If we follow the assumption that a predicated instruction still incurs its runtime cost but has no effect, then skipping over a large code block is sensible, especially if there is a chance that all threads go the same way.
Consider it like this: An if-else built with pure predication always costs as much as running both branches, since all threads go through all instructions. The same if-else built with branches where all threads follow the same path has only the cost of that path plus the synchronization overhead.
Note also that you can have predicated branches without divergence and synchronization. If the compiler can prove that divergence cannot happen (e.g. because the condition depends on a value that is equal for all threads), then the compiler will not issue an SSY
instruction