I know they can only correctly execute after instructions before them in Re-Order Buffer are committed. My doubt is, do modern processors hold them till they are last in ROB or do any prediction counters/structures are used even for predicting the flag values like, Zero flag, or Carry flag, and then redo them if they were mispredicted
1 Answers
I know they can only correctly execute after instructions before them in Re-Order Buffer are committed.
No, they only need their own inputs to be ready: those specific previous instructions executed, not retired / committed.
Conditional-move instructions (and ARM predicated execution) treat the flags input as a data dependency, just like add-with-carry, or just like an integer input register. The conditional instruction can't be sent to an execution unit until all 3 of its inputs are ready1. (Or on ARM, flags + however many inputs the predicated instruction normally has.)
Unlike with control dependencies (branches), they don't predict or speculate what the flags will be, so a cmovcc
instead of a jcc
can create a loop-carried dependency chain and end up being worse than a predictable branch. gcc optimization flag -O3 makes code slower than -O2 is an example of that.
Linus Torvalds explains in more detail why cmov often sucks: https://yarchive.net/comp/linux/cmov.html
(ARM predicated execution might be handled slightly differently. It has to logically NOP the instruction, even for a load or store to an invalid address. This might be handled with just fault suppression for conditional loads. I don't know if an instruction with a false predicate still costs any latency in the dependency chain for the destination register.)
Footnote 1: This is why cmovcc
and adc
are 2 uops on Intel before Broadwell: a single uop couldn't have 3 input dependencies. Haswell introduced support for 3-input uops for FMA.
cmov
instructions that read CF and one of the SPAZO flags (i.e. cmova
and cmovbe
which read CF and ZF) are actually still 2 uops on Skylake. See this Q&A for detail: it seems that those two separately-renamed groups of flags are both separate inputs, avoiding flag-merging. See also https://uops.info/ for uop counts.
See also http://agner.org/optimize/, and https://stackoverflow.com/tags/x86/info for more about x86 microarch details, and optimization guides.

- 328,167
- 45
- 605
- 847
-
Agner Fog in his [instruction table](https://www.agner.org/optimize/instruction_tables.pdf) mentions that `CMOVcc r,r` comprises `1 uop` on `Broadwell+`, but 2 uops on `Haswell-`. On my `KbL i7-8550U` however perf counters suggest that it comprises 2 uops. I suppose it was a typo. – St.Antario Apr 05 '20 at 10:41
-
2@St.Antario: Were you using `cmova` or `cmovbe`? Those are still 2 uops because they read both CF and a flag from the SPAZO cluster (specifically ZF). Other CMOV instructions are single uop. See https://uops.info/ See also [What is a Partial Flag Stall?](https://stackoverflow.com/q/49867597) for SKL details. – Peter Cordes Apr 05 '20 at 11:06