2

I know they can only correctly execute after instructions before them in Re-Order Buffer are committed. My doubt is, do modern processors hold them till they are last in ROB or do any prediction counters/structures are used even for predicting the flag values like, Zero flag, or Carry flag, and then redo them if they were mispredicted

Tiwari
  • 1,014
  • 2
  • 12
  • 22

1 Answers1

4

I know they can only correctly execute after instructions before them in Re-Order Buffer are committed.

No, they only need their own inputs to be ready: those specific previous instructions executed, not retired / committed.

Conditional-move instructions (and ARM predicated execution) treat the flags input as a data dependency, just like add-with-carry, or just like an integer input register. The conditional instruction can't be sent to an execution unit until all 3 of its inputs are ready1. (Or on ARM, flags + however many inputs the predicated instruction normally has.)

Unlike with control dependencies (branches), they don't predict or speculate what the flags will be, so a cmovcc instead of a jcc can create a loop-carried dependency chain and end up being worse than a predictable branch. gcc optimization flag -O3 makes code slower than -O2 is an example of that.

Linus Torvalds explains in more detail why cmov often sucks: https://yarchive.net/comp/linux/cmov.html

(ARM predicated execution might be handled slightly differently. It has to logically NOP the instruction, even for a load or store to an invalid address. This might be handled with just fault suppression for conditional loads. I don't know if an instruction with a false predicate still costs any latency in the dependency chain for the destination register.)


Footnote 1: This is why cmovcc and adc are 2 uops on Intel before Broadwell: a single uop couldn't have 3 input dependencies. Haswell introduced support for 3-input uops for FMA.

cmov instructions that read CF and one of the SPAZO flags (i.e. cmova and cmovbe which read CF and ZF) are actually still 2 uops on Skylake. See this Q&A for detail: it seems that those two separately-renamed groups of flags are both separate inputs, avoiding flag-merging. See also https://uops.info/ for uop counts.

See also http://agner.org/optimize/, and https://stackoverflow.com/tags/x86/info for more about x86 microarch details, and optimization guides.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Agner Fog in his [instruction table](https://www.agner.org/optimize/instruction_tables.pdf) mentions that `CMOVcc r,r` comprises `1 uop` on `Broadwell+`, but 2 uops on `Haswell-`. On my `KbL i7-8550U` however perf counters suggest that it comprises 2 uops. I suppose it was a typo. – St.Antario Apr 05 '20 at 10:41
  • 2
    @St.Antario: Were you using `cmova` or `cmovbe`? Those are still 2 uops because they read both CF and a flag from the SPAZO cluster (specifically ZF). Other CMOV instructions are single uop. See https://uops.info/ See also [What is a Partial Flag Stall?](https://stackoverflow.com/q/49867597) for SKL details. – Peter Cordes Apr 05 '20 at 11:06