What is it about CMOV which improves CPU pipeline performance?

Question

I understand when a branch is easily predicted its better to use an IF statement because the branch is totally free. I have learnt that if the branch isn't easily predicted, then a CMOV is better. However, I do not quite understand how this could be achieved?

Surely the problem domain is still the same- we do not know the address of the next instruction to execute? So I don't understand how all the way down the pipeline, when the CMOV is executed, how that could have helped the instruction fetcher (10 CPU cycles back in the past) choose the correct path and prevent a pipeline stall?

Could somebody please help me understand how CMOV improves branching?

possible duplicate of [Why is a conditional move not vulnerable for Branch Prediction Failure?](http://stackoverflow.com/questions/14131096/why-is-a-conditional-move-not-vulnerable-for-branch-prediction-failure) — Pascal Cuoq, Nov 25 '14 at 21:32
How do we “not know the address of the next instruction to execute?” If the instruction sequence is `I1; CMOV; I3`, the instructions `I1`, then `CMOV`, then `I3` are executed, always in this order. The one that comes after `I1` is `CMOV`. The one that comes after `CMOV` is `I3`. — Pascal Cuoq, Nov 25 '14 at 21:35
A one-instruction-long branch is a very special case; it lends itself to optimizations that a general purpose branch doesn't admit. — Seva Alekseyev, Nov 25 '14 at 21:35
Pascal, then how is this any different to an IF statement? We know both the start-instructions for both the IF branches.... — user997112, Nov 25 '14 at 21:36
Alternatively, you can think of CMOV as not a branch at all, as a command to the effect of `Dest = (Dest AND NOT Condition) OR (Source AND Condition)` where Source, Dest and Condition are bits. — Seva Alekseyev, Nov 25 '14 at 21:37
Seva's explanation is much better than either of the answers. It's a data dependency instead of a control dependency. It has 3 inputs (dest, src, and flags), and produces one output (dest). For the OOO pipeline, it can be tracked exactly like `adc` (which has the same three inputs and one output). — Peter Cordes, Feb 13 '16 at 18:49

zx485 · Answer 1 · 2016-02-13T07:24:03.460

Could somebody please help me understand how CMOV improves branching?

Well, it does NOT improve branching, it removes it. A CMOV could be seen as two instructions in one, a MOV and a NOP. Which one is executed depends on the flags. So internally it may look like

if (cond) {
    mov dst, src
} else {
    nop
}

...

Surely the problem domain is still the same- we do not know the address of the next instruction to execute?

Well, no. The next instruction is always the one following the CMOV, so the instruction pipeline is not invalidated and reloaded (branch prediction and other optimiziations left aside). It is one continuos flow of macro-opcodes. A simple example is following

if (ecx==5)
    eax = TRUE
else
    eax = FALSE

in basic asm:

cmp ecx,5      ; is ecx==5
jne unequal    ; what is the address of the next instruction? conditional branch
mov eax,TRUE   ; possibility one
jmp fin
unequal:       : possibility two
mov eax,FALSE
fin:
nop

with CMOV

cmp ecx,5
mov eax, FALSE   ; mov doesn't affect flags
mov ebx, TRUE    ; because CMOV doesn't take immediate src operands, use EBX for alternative
cmove eax, ebx   ; executes as MOV if zero-flag is set, otherwise as NOP
nop              ; always the next instruction, no pipeline stall

Is it worth it on current CPUs? A clear YES. From my experience and (of course) depending on the algorithm, the speed gain is significant and worth the effort.

`cmove` doesn't allow for immediate operands. So you need to clobber a register: `mov ebx, TRUE` then `cmove eax, ebx`. — Björn Lindqvist, Jun 23 '15 at 10:37
That's right. Or use a variable like ddTRUE if you have no spare register. — zx485, Jun 24 '15 at 07:04

score 19 · Answer 2 · answered Nov 25 '14 at 22:10

CMOV instructions don't direct the path of control flow. They are instructions that are executed to compute the result based on condition codes, i.e. predicated instructions. Some architectures (like ARM) can predicate many forms of instructions based on condition codes, but x86 can only do "mov", that is, the conditional move (CMOV). These are decoded, and executed with latency in order to determine the result of the instruction.

Branches, on the other hand, are predicted and actually steer the execution of instructions. The branch predictor "looks ahead" of the instruction "fetcher", specifically looking for branch instructions, and predicts the path by steering the flow. Think of a railroad track where a person ahead shifts the tracks either left or right to tell the train where to go. Now if the guy chose the wrong direction, the train has to stop, backup, then move again in the right direction. Lots of time wasted.

CMOVs, on the other hand, don't steer the flow. They are simply instructions take extra time (and create additional dependencies) to figure out the proper result of the move based on condition codes. Think of the train, instead of decided to go left or right, takes a straight path that requires no turn, but is a bit slower (obviously way more complicated, but it's the best I can come up with right now).

CMOVs used to be really bad (very high latency) but have since improved to be pretty fast, making them a lot more usable and performance-worthy.

Hope this helps..

Good answer, although I wish the train example for cmov could have included splitting the carts between the railroads somehow, with one of them eventually plunging to an abyss (obviously there would have to be a rooftop fight scene too) — Leeor, Nov 26 '14 at 08:24
From my testing, short jumps (1-4 instructions) end up being faster in some situations. — BitBank, May 25 '15 at 08:31

What is it about CMOV which improves CPU pipeline performance?

2 Answers2