4

I have a test question here.

Which instructions might potentially slow down processor's work, then pipeline doesn't predict (branch prediction) further way of executing?

Possible answers: JGE | ADD | SUB | PUSH | JMP | JNZ | MUL | JG | CALL

If we are talking about branch prediction, are JGE, JMP, JNZ & JG the way to go?

hidefromkgb
  • 5,834
  • 1
  • 13
  • 44
devKanapka
  • 83
  • 7
  • Unconditional JMP cannot be mispredicted, as it jumps unconditionally. – ecm Jan 16 '20 at 12:12
  • @ecm so, because JMP jumps unconditionally, pipeline always predicts outcome? – devKanapka Jan 16 '20 at 12:17
  • 1
    Pretty much yeah. – ecm Jan 16 '20 at 16:33
  • @ecm Wait, but what about indirect branches and indirect calls, like `JMP EAX` / `CALL EAX`? When `EAX` gets computed using some over-the-top formula in-situ, indirect jumps and calls are surely going to produce pipeline bubbles. – hidefromkgb Jan 20 '20 at 18:20
  • @ecm and @hidefrom: even direct `jmp rel32` (and `call`) can be mispredicted early in the front-end, needing to re-steer. See [Slow jmp-instruction](//stackoverflow.com/q/38811901). But yes, indirect jmp/call definitely need branch prediction and can get all the way into the back-end before the misprediction is detected, so only the register or memory indirect forms ever need to roll back uops in the back-end. – Peter Cordes Jan 20 '20 at 20:21
  • @ hidefromkgb and @Peter Cordes: You're both right of course. I didn't take into account indirect jumps and calls -- nor return instructions, which are another type of indirect branch one could say. Interesting link on the slow jmp too. – ecm Jan 20 '20 at 21:36
  • 1
    @ecm: yup, `ret` is interesting because it usually matches up with a `call`, so CPUs usually have a special predictor stack (like 16 or 24 entries) dedicated to ret, predicting it much better than treating it as any other indirect branch for the BTB. e.g. for x86 http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/. High-performance ARM microarchitectures do similar things for `bx lr`. But anyway, for me the more surprising fact to learn was that even direct unconditional branches need prediction for the fetch stage, before they're even decoded. `ret` is more clearly an indirect branch. – Peter Cordes Jan 20 '20 at 22:11

1 Answers1

8

The instructions like mul that don't do anything special to EIP of course can't mispredict, but every kind of jump / call / branch can mispredict to some degree in a pipelined design, even a simple call rel32. The effects can be serious in a heavily pipelined out-of-order execution design like modern x86 CPUs.

Yes, jcc conditional branches always need prediction; the value of FLAGS isn't available when decoding, only later when executing.

Even direct jmp rel8 / jmp rel32 (and call rel32) need prediction early in the front-end, before they're even decoded, so the fetch stage knows which block to fetch next after fetching a block that might or might not include a jump (unconditional or predicted-taken conditional; it doesn't need to know, just whether to keep fetching in a straight line or not). See Slow jmp-instruction for more about simple unconditional direct branches running slower if you have too many for the BTB.

If you consider a simple in-order pipeline like a classic 5-stage RISC, with no buffers between stages, all branches are basically equivalent: the fetch stage needs to fetch 1 instruction per clock to avoid bubbles. It needs to know the next fetch address while the previous instruction is still decoding. Longer pipelines make this problem even worse.


But more simply, there are indirect forms of jmp and call like jmp eax or jmp [edi] that load a new EIP from a register or memory. Those obviously need prediction; you have unlimited possibilities for where it will go, not just taken or not-taken.

Branches that depend on data (conditional on FLAGS, or indirect on register or memory) can get all the way into the back-end (and execute out-of-order) before a mispredict is discovered. Recovering may require discarding results of executing later instructions from the wrong path, as well as fetching/decoding the correct path. What exactly happens when a skylake CPU mispredicts a branch?

But handling mispredicts of direct jmp/call is simpler: just re-steer the fetch/decode stages because the target address is known after decoding the instruction, without having to execute it. The misprediction doesn't make it into the back-end so it's "just" a bubble in the front-end.


Fun fact: ret can also mispredict; it's basically an indirect branch (pop eip). But there are special predictors that take advantage of the usual pairing between call and ret instructions, keeping an internal stack of recent calls that mirrors how the callstack in memory will probably be used. http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847