Branchless comparison in x86_64

Question

Recently I took the course of discrete math. The professor tells us that branching is slower than branchless.

AFAIK modern CPUs use pipeline to increase the efficiency, so breaching in CPUs are essentially assuming a result of previous instructions, do the operations base on it until the previous instructions finished then compare the assumption and actual result.

If match then the result based on assumption are accepted, else the operations will be redone using the actual value previous produced.

My questions are:

The real slow part is when assumption mismatch the product of previous, isn't it ?

1-1. If CPUs always guessed right, then it's actually faster right?

The following is x == y in x86_64 asm:

2-1. Branching

XeqY1:
    xor     rax, rax ; clear rax
    cmp     rdi, rsi ; compare X and Y
    sete    al       ; set al = 1 if equal

2-2. Branchless

XeqY2:
    mov     rax, rdi ; 
    sub     rax, rsi ; 
    sub     rsi, rdi ; 
    xor     rax, rsi ; tmp = (X-Y)^(Y-X)
    not     rax      ; invert tmp
    shr     rax, 63  ; get sign bit

2-3. For above two, 2-1 only takes about 3 Cycle. 2-2 takes about 6 cycle, which is doubled of branching version. Does it really works faster? It's weird to me that 2-1 will be slower than 2-2 (according to professor).

EDIT 13:53

Thanks for the replies!!

I didn't know that SETcc and CMOVcc was branchless, what if change 2-1 to:

XeqY1:
    cmp     rdi, rsi
    jne     .L0
    mov     rax, 1
.L0:
    mov     rax, 0

Does this change make it non-branchless? If so, will it be faster or slower to 2-2?

Related: [Which instructions can produce a branch misprediction on x86 CPUs?](https://stackoverflow.com/q/59766432) - only instructions that change RIP, so not `setcc` or `cmovcc`. — Peter Cordes, Dec 24 '21 at 05:23
For 2-2, how do you figure 6 cycles? There's some instruction-level parallelism, and all real x86-64 CPUs are superscalar. It's obviously terrible vs. 2-1, but its latency critical path is at worst 5 cycles, maybe 4 with mov-elimination, for CPUs with 1 cycle latency for everything. (So maybe not Prescott P4 with its slow shifter). And throughput in terms of overlap with independent work, lots of room, only 6 uops, goes through the front-end of Skylake in 1.5 cycles. See [this Q&A](https://stackoverflow.com/q/51607391) — Peter Cordes, Dec 24 '21 at 05:23
There *is* a real effect to talk about, yes branch mispredicts are somewhat costly, but this code doesn't demonstrate it because `sete` isn't a branch. IDK exactly what question you want to ask after finding out that the premise of your example is false, but if you want to edit into something different we could reopen. But see [**Modern Microprocessors A 90-Minute Guide!**](http://www.lighterra.com/papers/modernmicroprocessors/) and [What exactly happens when a skylake CPU mispredicts a branch?](https://stackoverflow.com/q/50984007) first. — Peter Cordes, Dec 24 '21 at 05:28
Thanks for replying! The estimation of the cycle is from a chart i saw a while ago, IDK if the chart is correct or not, it's my fault not including the source but I forgot how did i search it. I thought as long as conditional involved, the branching will occur. I just newly learning assembly, and not yet take the courses about architecture or working principles. Really appreciate you for spending time on the questions! — TNThung, Dec 24 '21 at 05:50
You're right, your professor is over-simplifying. Predicted branches cost almost nothing; "branchless" prevents the CPU from using an assumed/predicted condition and is often worse than a predicted branch; and a mispredicted branch is much worse than "branchless". Also note that on 80x86 "branchless" is very limited and can only be considered for simple expressions (e.g. you can't do something like "`if(x == 0) { close(file); }`"). — Brendan, Dec 24 '21 at 07:08
"Branch" means "conditional jump", where the actual flow of execution may change depending on the condition. In x86, that's precisely the `Jcc` instructions. The instruction after `Jcc` in memory may or may not be executed, depending on the truth of the condition. `SETcc` and `CMOVcc` are not branches, and the instruction following them will be executed no matter what. — Nate Eldredge, Dec 24 '21 at 07:22
Thanks again! So the reason for `SETcc` and `CMOVcc` are branchless is because they'll execute whatever follows behind, regardless of whether the condition is fulfilled, is that correct? — TNThung, Dec 24 '21 at 08:16
@洪明聖 It's because these are not conditional branch instructions. A conditional branch instruction is one that diverts the control flow depending on a condition. `SETcc` and `CMOVcc` do not divert the control flow and are not conditional branches. They are just arithmetic instructions taking the flags for inputs. — fuz, Dec 24 '21 at 09:35
@fuz: I later remembered we had some better duplicates that explain this: [Is CMOVcc considered a branching instruction?](https://stackoverflow.com/q/57524415) and [Why is a conditional move not vulnerable for Branch Prediction Failure?](https://stackoverflow.com/q/14131096) - added those to the list of duplicates. (洪明聖 - you can notify people when you reply to their comments, with @ user like I did for fuz. Your reply didn't show up in my inbox; I just happened to look at this question again to add duplictes.) — Peter Cordes, Dec 24 '21 at 12:32
@PeterCordes Sorry for that, I'm new to using this site. The articles you sourced to, that's exactly what I'm confused about. U R THE BEST!! Thank you!! — TNThung, Dec 25 '21 at 07:53

Branchless comparison in x86_64

0 Answers0