How can I cause indirect (function pointer) call to be correctly jump/branch predicted?

Question

let's say I have a function that accepts a callback argument (example given in rust and C)

void foo(void (*bar)(int)) {
    // lots of computation
    bar(3);
}

fn foo(bar: fn(u32)) {
    // lots of computation
    bar(3)
}

can I rely on the indirect call to bar being correctly predicted by the CPU? Assume that at the callsite of foo, the value of bar is in fact highly dynamic and unpredictable - but because of // lots of computation, my expectation is that the CPU has enough "advance warning" such that speculative/out-of-order execution can work across the function pointer boundary.

Is this the case? If not, is there anything I can do as the programmer to help the CPU out?

You could instead take a generic `bar: impl Fn(u32)` which will monomorphise exclusively for each distinct argument with which `foo` is called. Or else examine the assembly, perhaps adding the `#[inline]` attribute if useful. — eggyal, Aug 04 '23 at 17:34
No. How can a processor know it? Do you want AI built into the MCU? — 0___________, Aug 04 '23 at 17:38
Forget about the CPU. Ask how the *compiler* can predict this. — tadman, Aug 04 '23 at 17:41
You passed the function as an argument to `foo`, so why does the CPU have to predict anything? It's fixed up by the linker and the loader. — Weather Vane, Aug 04 '23 at 17:48
@WeatherVane linker. Does linker run along which your program? — 0___________, Aug 04 '23 at 17:56
@0___________ no, are you unsure of that? The question is about C and rust. — Weather Vane, Aug 04 '23 at 17:58
@WeatherVane: Re “why does the CPU have to predict anything? It's fixed up by the linker and the loader”: The linker and loader fix up addresses and record them in the program in a form the CPU can use. But that is not what prediction is about. Prediction, in this case, is about the CPU choosing what bytes to load from memory because program control will soon flow through them. Normally, in a high-performance CPU, the processor starts loading instructions many cycles in advance of decoding, dispatching, and executing them… — Eric Postpischil, Aug 04 '23 at 18:03
… As part of this, it partially decodes instructions, and, when it sees branch instructions, tries to predict where program control will flow to. Processors do a combination of things to predict: Start by guessing backwards branches (negative displacements) are likely to be taken (loops are repeated) and forward branches are not (if-else-then structures, with compilers aware of how the processor guesses). Cache recent executions. Read hint bits stored in the instructions. With a computed branch, the address taken from a register… — Eric Postpischil, Aug 04 '23 at 18:05
… the processor might not know what that address will be. Some processors may cache the address for a small number of computed branch instructions, assuming it will be repeated. In this case, OP would like the processor to recognize the branch, fetch the address from the register, and start loading instructions from that address, sufficiently in advance of control reaching the branch that the processor will not start the loads late and have to pause while it waits for memory to respond. — Eric Postpischil, Aug 04 '23 at 18:06
@EricPostpischil thank you for clarifiying the OP's question. I do get what pipelining is about, if not in detail. — Weather Vane, Aug 04 '23 at 18:08
@0___________: Re “How can a processor know it?”: The processor can examine the branch instruction in advance of program control reaching it (see above), fetch the register it refers to (consulting the rename table and other information to ensure it will not be changed before execution reaches the instruction), and start loading instructions from the address given in the register. — Eric Postpischil, Aug 04 '23 at 18:08
You can't guarantee a correct prediction. The best you can do is what you're alreadydoing: let out-of-order exec detect a misprediction early and recover before it costs much wasted work. (So any mis-speculated instructions hopefully haven't been executed yet, and the front-end can re-steer in time to avoid losing cycles.) [Avoid stalling pipeline by calculating conditional early](https://stackoverflow.com/q/49932119) — Peter Cordes, Aug 04 '23 at 18:22
@EricPostpischil however note that in the given example, there is actually no "branch" - there will be an unconditional jump to a statically-unknown destination (but the destination is "known" many cycles in advance of the jump actually occuring) — ajp, Aug 04 '23 at 18:31
@eggyal I specified that "at the callsite of foo the value of bar is in fact highly dynamic and unpredictable", so monomorphism is not the answer here — ajp, Aug 04 '23 at 18:34
@0___________ the idea would be that `rdi` is sitting around unchanged for many cycles (variable `bar` is unchanged during function `foo`), so when the out-of-order execution gets to the `jmp`, it can "predict" that `rdi` will have the value that it does in fact already contain. (this is how I'd like things to work, no idea if it actually does, hence the question) — ajp, Aug 04 '23 at 18:37
@0___________: By reading the contents of the `rdi` register before program control reaches the branch. — Eric Postpischil, Aug 04 '23 at 18:41
@EricPostpischil `jmp [edi]` then? Pipeline will read from the memory? — 0___________, Aug 04 '23 at 18:46

score 4 · Answer 1 · answered Aug 04 '23 at 18:32

You can't guarantee a correct prediction.

The best you can do is what you're already doing: have the branch condition (or target) ready early, to let out-of-order exec detect a misprediction early and recover before it costs much wasted work. (So any mis-speculated instructions hopefully haven't been executed yet, and the front-end can re-steer in time to avoid losing cycles.)

See Avoid stalling pipeline by calculating conditional early

A mispredict will unavoidably cost cycles for the front-end, but a lot of code executes slower than 4 instructions per cycle (or 4 uops per cycle), so the front-end is able to get ahead of where the oldest uops are still executing, especially if there are bottlenecks like a long dependency chain without a lot of instruction-level parallelism.

Modern branch predictors are quite good even with complex patterns, e.g. IT-TAGE uses history of the past few branches to index a predictor for this one. This leads to decent performance even in the switch in an interpreter loop or something like that, unlike in older CPUs where a complex pattern for a single indirect branch was a big problem. (See How does Branch Prediction affect performance in R? for some links, especially Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore (2015) by Rohou, Swamy, and Seznec.

How can I cause indirect (function pointer) call to be correctly jump/branch predicted?

1 Answers1