In assembly, should branchless code use complementary CMOVs?

Question

It's well known that we can use the CMOV instruction to write branchless code, but I was wondering if I'm writing the equivalent of x = cond ? 1 : 2, should I prefer

CMOVE rax, 1    #1a
CMOVNE rax, 2   #1b

or

MOV rax, 1      #2a
CMOVNE rax, 2   #2b

In theory the first one can execute pretty much in parallel while the second one is slower because of a data dependency. But I'm not sure how it is in reality.

The first one also has a data dependency. `CMOVcc rax, ...` has an input dependency on `rax` because it remains unmodified if `cc` is false. — Nate Eldredge, Nov 23 '22 at 20:44
Also, `CMOVcc` has no immediate form, so to use the first one, you would need to load the values 1 and 2 into two different registers (or use one register and reload it in between). — Nate Eldredge, Nov 23 '22 at 20:45
The advantage of the second one is that `MOV rax, 1` does *not* have an input dependency, so it could execute out-of-order long before the condition is known. `CMOVcc` has an additional input dependency on the flags, so in your first version, the first `CMOVE` has to wait for FLAGS to be ready, and the second one has to wait for the first. — Nate Eldredge, Nov 23 '22 at 20:47
Nate is right. The first one cannot execute in parallel, there's clearly a dependency on `rax` on the first instruction of the first snippet. When in a loop, the first snipped executes at 2c per iteration, while the second is 1c per iteration. — Margaret Bloom, Nov 23 '22 at 20:49
Be also aware that some conditions are easy to optimize. While irrelevant to your very interesting question, the example used can be easily optimized [and compilers actually do it](https://godbolt.org/z/74of386vd). — Margaret Bloom, Nov 23 '22 at 20:53

Nate Eldredge · Accepted Answer · 2022-11-23T21:28:30.057

The second one seems better.

First, note that CMOVcc does not have an immediate form; the second operand must also be a register (or memory). And so CMOVcc rax, rbx actually has three input dependencies; rax, rbx, and flags. rax is an input dependency because if cc is false, then the output value in rax must equal the value it had before. So the instruction will always stall until all three of the inputs are ready.

You might imagine a "conditional dependency" where the instruction need not stall on rax if cc is true, but I don't believe any currently existing machine can actually do that. Generally, a conditional move is treated as an arithmetic/logic instruction, kind of similar to adc rax, rbx: it computes some function of rax, rbx and the flags, and leaves the result in rax. You could think of that function as something like rax = (~mask & rax) | (mask & rbx).

(This is one of the main disadvantages of branchless code: it always has to wait for both results to be ready. A branch may seem worse, but if it's correctly predicted, then it only waits on the result that is actually needed. Linus Torvalds wrote a famous rant about this.)

So the first example would look more completely like

mov rbx, 1
mov rcx, 2
cmp whatever
cmove rax, rbx
cmovne rax, rcx

(I know we should be doing it all with 32-bit registers to save the REX prefixes, but it's just an example.)

And we can now see several problems.

The cmovs must wait for rbx and rcx to be ready, but that's probably not an issue; the immediate movs have no input dependency at all, so they could have been executed out-of-order long ago.
More seriously, the second cmov has an input dependency on the first one's output, via rax. So the two cmovs must actually execute serially and not in parallel.
And worse, the first cmov has a false dependency on whatever the previous value of rax was. If some earlier code is doing something slow with rax, this code will stall until that other code is done, even though the value left in rax by the earlier code should be totally irrelevant to this snippet.

The second alternative would look like:

mov rax, 1
mov rbx, 2
cmp whatever
cmovne rax, rbx

This avoids most of those issues. Now the only real input dependency is the flags produced by the cmp (and thus on whatever the inputs to cmp are). As before, the immediate movs to rax and rbx have no input dependencies and can be done in advance. And there is no false dependency on the previous value of rax because mov rax, 1 definitively wipes it out. As a final bonus, this version is one instruction shorter.

Thank you, great answer. I had wondered a bit if the processor would be smart enough to parallelize CMOVE and CMOVNE since they are complementary, but that doesn't seem to be the case. And unnecessary given the other restrictions with immediate values and "unconditional dependency". — Daniel, Nov 23 '22 at 21:21
@Daniel: On ARM, it apparently is common for compilers to generate code with complementary conditions like `movne` / `moveq` (https://godbolt.org/z/6KaYaG8Kj), but ARM predicated execution makes an instruction into a NOP, even if it was a load from a bad address or something. On other ISAs, including x86 cmov or AArch64 csinc (conditional select/increment), you only have a conditional-select ALU operation with 3 inputs, as Nate explained, with the pipeline / scheduling not seeing it as a NOP. Unlike with ARM predicated execution which is visible to the pipeline, which works on any insn. — Peter Cordes, Nov 24 '22 at 04:40

In assembly, should branchless code use complementary CMOVs?

1 Answers1