Recently I took the course of discrete math. The professor tells us that branching is slower than branchless.
AFAIK modern CPUs use pipeline to increase the efficiency, so breaching in CPUs are essentially assuming a result of previous instructions, do the operations base on it until the previous instructions finished then compare the assumption and actual result.
If match then the result based on assumption are accepted, else the operations will be redone using the actual value previous produced.
My questions are:
- The real slow part is when assumption mismatch the product of previous, isn't it ?
1-1. If CPUs always guessed right, then it's actually faster right?
- The following is x == y in x86_64 asm:
2-1. Branching
XeqY1:
xor rax, rax ; clear rax
cmp rdi, rsi ; compare X and Y
sete al ; set al = 1 if equal
2-2. Branchless
XeqY2:
mov rax, rdi ;
sub rax, rsi ;
sub rsi, rdi ;
xor rax, rsi ; tmp = (X-Y)^(Y-X)
not rax ; invert tmp
shr rax, 63 ; get sign bit
2-3. For above two, 2-1 only takes about 3 Cycle. 2-2 takes about 6 cycle, which is doubled of branching version. Does it really works faster? It's weird to me that 2-1 will be slower than 2-2 (according to professor).
EDIT 13:53
Thanks for the replies!!
I didn't know that SETcc
and CMOVcc
was branchless, what if change 2-1 to:
XeqY1:
cmp rdi, rsi
jne .L0
mov rax, 1
.L0:
mov rax, 0
Does this change make it non-branchless? If so, will it be faster or slower to 2-2?