is mov rax,0x12345678; jmp rax still kills branch prediction?

Question

I'm having trouble finding information specific to the two cases described above, And though of hearing your expert opinion.

The first thing is: I know indirect jmps hurts branch prediction, and that even when the result of the indirection is constant, it still requires the prediction maintenance buffer and stuff, all in compare to absolute jmp.

My question is, if anyone knows if:

mov rax, 1234567812345678h;
jmp rax;

Still considered indirect by the processor's branch predictor, or does it do the math in this case.. I'm doing so because x64 don't have a direct "jmp absolute 64" instruction, only indirect. :/ (How to execute a call instruction with a 64-bit absolute address? suggests this, if you can't instead put the jump close enough to the target and use jmp rel32.)

Secondly, to that extent, is there any real difference between jmp 0x1234 and call 0x1234 (in terms of processor optimization (instruction cache, prefetcher and it's hints, branch prediction)) ? (vc2012 "speed optimization" yields call, "min_size opt" yields jmp, "mixed optimization" yields jmp for x64, call for x86)

Don't confuse branch-prediction and branch target prediction. Branch prediction is *whether* the branch will jump. Branch target prediction is *where* the branch will jump. In this case, there is no branch prediction - it's an unconditional jump. — Mysticial, Oct 30 '12 at 02:34
I'd also add that branch target prediction is likely to be extremely good in this case (if CPU remembers the branch target from last time, then..). — Brendan, Oct 30 '12 at 06:13
So if i understood correctly, there isn't much difference (in terms of cpu hardware resources) from that RAX being hardcodedly preinitialized to a fixed address, and that RAX being volatile, it will use target prediction resources of the cpu for both cases ? (and the only extra cost of the later case would be indirection of RAX reading from another var(for example)), or would it be smarter to say "it's hardcodedly preinitialized a line before, so i don't need to occupy the branching target history buffer" ) ? — win32 devPart, Oct 30 '12 at 08:46
What i'm concern about is if the prefetcher / early phase in the pipeline will think that since the jmp is eax base it is not "absolute" and thus it cannot prefetch target instructions, until last minute when it reaches the jmp (when it will know for sure rax value), instead of concluding that rax is hardcoded, so the jmp is absolute.. — win32 devPart, Oct 30 '12 at 10:23

score 2 · Answer 1 · answered Dec 31 '12 at 18:30

Intel's branch target (and branch) prediction is both very sophisticated and a closely held trade secret. There isn't necessarily one single algorithm, that is, you can expect that the prediction mechanisms vary across CPUs; this depending on the number of transistors intel wants to throw at the problem for a given processor. And, of course, there are other manufacturers of x86 and x64 processors besides intel.

The historical branch target prediction mechanism -- which uses past runs of same instruction to predict target for subsequent executions -- will almost certainly predict the right target for this branch because there is only one. So, if this code sequence is re-executed (e.g. in a loop) and it stays in the instruction cache for a while it will likely be handled very well. (However, on some processors, the branch target prediction mechanism could by neutralized by similar effect to cache line collision if another branch elsewhere happens cause a hash collision.)

A bigger question probably is how well it is handled if such sequence liberally occurs in code newly loaded into the cache, which goes to a processor's non-history-based target prediction capabilities. Such (non-historical) branch target prediction could easily determine the branch location given this code sequence, though it depends entirely on whether the manufacturer deems it worthy of the real-estate on the die for any given processor. Factors to make such decision include power consumption, tradeoffs other performance improvements (i.e. possible better uses of the same die area), and frequency expected of such and various other code sequences.

But Agner documents some features of Intel's cpus; branch prediction is on pages 11-34. — osgx, Dec 31 '12 at 20:59
I haven't heard of any x86-64 CPUs fusing mov r64,imm64 / jmp reg into a single direct-jmp uop, or even doing prediction based on that. ARM CPUs do something like that that for thumb branches which are technically 2 instructions, one to set some bits of the branch target, the other to have the rest and jump. But that's only ever used as a pair and doesn't have a register side-effect, and is common. None of those are true for x86 branches: much more common are memory-indirect branches (all call into dynamic libraries) — Peter Cordes, Jan 14 '21 at 19:23

Olsonist · Answer 2 · 2021-01-14T20:41:23.210

1

"I know indirect jmps hurts branch prediction"

No. Branch prediction and indirect jump prediction are different. Moreover, indirect jumps are used in table based switch statements and in interpreters. These are very common use cases and show up in benchmarks. Consequently, Intel and others have spent a lot of effort and a lot of transistors improving their performance. One paper (written well after the question!) even went so far as to say that starting with Sandy Bridge, you shouldn't trust folklore when it comes to this indirect jump prediction. Intel+AMD have an incentive to improve this performance and they have.

Now if your jmp example is cold code, if this is the first time it is executed, it's impossible to predict and indeed the Skylake indirect jump predictor will predict the next instruction after the jump and speculate from there. You can shut that speculation down with a UD2, an illegal instruction. In any case, the second time jmp is executed, (if it's still in the BTB) the branch target will be correct.

As to your second question, the cache effects won't matter. I suppose the smaller version could heroically save a cache line spill, but that's it. The HW prefetcher is for data, not instructions.

edited Jan 14 '21 at 20:41

answered Jan 14 '21 at 20:32

Olsonist

2,051
1
20
35

The paper you linked (https://hal.inria.fr/hal-01100647/document) shows that it's Haswell, not SnB, that really does well at predicting a grand-central-dispatch branch in an interpreter. (believed to be using IT-TAGE). Of course an indirect branch that always goes to the same place is vastly easier to predict, and any form of indirect-branch prediction will succeed (barring destructive aliasing), so even Atom or Pentium 2 would have little problem if the branch runs frequently. – Peter Cordes Jan 14 '21 at 20:40
"On the next processor generation Sandy Bridge, the misprediction rate is much lower." The point is that they've addressed for several generations. – Olsonist Jan 14 '21 at 20:44
1

And BTW, branch prediction in general includes target-prediction for indirect branches. You're kind of implying that they're two different things of similar scope, like branch-direction prediction vs. indirect branch target prediction. There isn't AFAIK a specific single meaning for "branch prediction" that excludes indirect branches. Also note that the front-end needs a prediction on which *block* to fetch next, before the current block is even decoded to see if it contains any branches including relative direct. ([Slow jmp-instruction](https://stackoverflow.com/q/38811901)) – Peter Cordes Jan 14 '21 at 20:46
Ok yes, SnB has better branch predictors than NHM. But it's Haswell that makes the biggest change in how the predictors work internally, using IT-TAGE for the first time, as that paper shows with its charts and so on. Especially since you talk about interpreters in that paragraph, it's Haswell that made simple dispatch perform well. – Peter Cordes Jan 14 '21 at 20:52
Also, the L2 streamer and adjacent-line prefetchers are in unified L2, and will affect code and data. L1i prefetch is a thing, too, to hide L2 latency for straight-line code. I forget any details, but I'm pretty sure Intel has one. (Speculative demand-load from the front-end, based on branch prediction, might have been what the OP actually meant, separate from dedicated HW prefetchers. Or the real thing that best aligns with the vague way the OP was throwing around terminology they'd heard of.) – Peter Cordes Jan 14 '21 at 20:57
1

The CAAQA section on the Core i7 branch predictor describes 2-bit and tournament predictors for conditional branches. But it then describes indirect predictors saying "a separate unit predicts target addresses for indirect branches". I think that means they don't compete for the same BTB slots. As for branch vs jump prediction, CAAQA distinguishes them in its Studies of the Limits of ILP section. (But as an aside, that book is terrible for definitions.) As for prefetchers, I did not know that HW cache line prefetch for instructions was a thing. Is L1i memory prefetch or just cache elevation? – Olsonist Jan 14 '21 at 21:29
https://community.intel.com/t5/Software-Tuning-Performance/Instruction-prefetcher-missing-from-Optimization-Manual/td-p/1161502 has an answer from Hadi Brais about the existence of an L1i cache prefetcher. (But the main focus of that forum question is how to do SW prefetch for code, which is a different topic.) – Peter Cordes Jan 15 '21 at 09:01
re: terminology: "branch prediction" can be used to specifically mean taken/not-taken for conditional branches, which as you say is separate from branch target prediction. But I think it also gets used as a catch-all term for any kind of prediction of non-linear control-flow, so the OP isn't totally wrong in their use of it that way. – Peter Cordes Jan 15 '21 at 09:04
Just because the L2 is unified doesn't *necessarily* mean that the prefetchers there apply equally to data and instructions: the L2 probably still knows the source of the request (indeed, they may be handled differently in some respects: at a minimum they are counted differently by the perf counters), and could track them differently for prefetching. – BeeOnRope Jan 15 '21 at 10:16

is mov rax,0x12345678; jmp rax still kills branch prediction?

2 Answers2

Related