30

Modern CPUs have extensive pipelining, that is, they are loading necessary instructions and data long before they actually execute the instruction.

Sometimes, the data loaded into the pipeline gets invalidated, and the pipeline must be cleared and reloaded with new data. The time it takes to refill the pipeline can be considerable, and cause a performance slowdown.

If I call a function pointer in C, is the pipeline smart enough to realize that the pointer in the pipeline is a function pointer, and that it should follow that pointer for the next instructions? Or will having a function pointer cause the pipeline to clear and reduce performance?

I'm working in C, but I imagine this is even more important in C++ where many function calls are through v-tables.


edit @JensGustedt writes:

To be a real performance hit for function calls, the function that you call must be extremely brief. If you observe this by measuring your code, you definitively should revisit your design to allow that call to be inlined

Unfortunately, that may be the trap that I fell into.

I wrote the target function small and fast for performance reasons.

But it is referenced by a function-pointer so that it can easily be replaced with other functions (Just make the pointer reference a different function!). Because I refer to it via a function-pointer, I don't think it can be inlined.

So, I have an extremely brief, not-inlined function.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
abelenky
  • 63,815
  • 23
  • 109
  • 159
  • I guess it depends on the platform to a certain extent; can we assume you're talking about x86? – Oliver Charlesworth May 25 '12 at 15:34
  • Yes, x86. (64-bit, Core2, to be more precise) – abelenky May 25 '12 at 15:35
  • Is a function pointer any different to a regular function call? – Martin Beckett May 25 '12 at 15:39
  • @MartinBeckett: Well it involves an extra level of indirection... – Oliver Charlesworth May 25 '12 at 15:43
  • @OliCharlesworth - and it can't be predicted at compile time I suppose – Martin Beckett May 25 '12 at 15:59
  • To be a real performance hit for function calls, the function that you call must be extremely brief. If you observe this by measuring your code, you definitively should revisit your design to allow that call to be inlined. – Jens Gustedt May 25 '12 at 16:25
  • @MartinBeckett A regular function call has its address encoded in the instruction stream, and so the branch predictor can work out the target at the same time the instruction gets decoded -- long before the branch opcode actually hits the ALU. A function pointer means a calculated target in some register, so the new IP can't be calculated until the branch opcode actually gets to the execute stage, which is many cycles later in the pipeline. Basically it's almost always possible to properly predict a static branch, and almost never possible to predict a calculated (not conditional) branch. – Crashworks Jun 05 '12 at 03:05
  • @Crashworks - yes just spotted the typo. I meant to say you CAN predict function calls at compile time (unlike function ptrs) – Martin Beckett Jun 05 '12 at 12:55

4 Answers4

15

On some processors an indirect branch will always clear at least part of the pipeline, because it will always mispredict. This is especially the case for in-order processors.

For example, I ran some timings on the processor we develop for, comparing the overhead of an inline function call, versus a direct function call, versus an indirect function call (virtual function or function pointer; they're identical on this platform).

I wrote a tiny function body and measured it in a tight loop of millions of calls, to determine the cost of just the call penalty. The "inline" function was a control group measuring just the cost of the function body (basically a single load op). The direct function measured the penalty of a correctly predicted branch (because it's a static target and the PPC's predictor can always get that right) and the function prologue. The indirect function measured the penalty of a bctrl indirect branch.

614,400,000 function calls:

inline:   411.924 ms  (   2 cycles/call )
direct:  3406.297 ms  ( ~17 cycles/call )
virtual: 8080.708 ms  ( ~39 cycles/call )

As you can see, the direct call costs 15 cycles more than the function body, and the virtual call (exactly equivalent to a function pointer call) costs 22 cycles more than the direct call. That happens to be approximately how many pipeline stages there are between the start of the pipline (instruction fetch) and the end of the branch ALU. Therefore on this architecture, an indirect branch (aka a virtual call) causes a clear of 22 pipeline stages 100% of the time.

Other architectures may vary. You should make these determinations from direct empirical measurements, or from the CPU's pipeline specifications, rather than assumptions about what processors "should" predict, because implementations are so different. In this case the pipeline clear occurs because there is no way for the branch predictor to know where the bctrl will go until it has retired. At best it could guess that it's to the same target as the last bctrl, and this particular CPU doesn't even try that guess.

Crashworks
  • 40,496
  • 12
  • 101
  • 170
  • 2
    This is a very interesting post, but it should be emphasized that it depends on the architecture. From your post (the link is outdated, btw), it seems you were testing IBM Xenon processors. I would also presume the same might apply for lower end embedded microcontrollers, but Intel and AMD processors have had some sort of prediction for indirect jumps since ~2004 or something. Even todays ARM processors shouldn't clear the entire pipeline without a misprediction. – vgru Jun 06 '16 at 10:44
10

Calling a function pointer is not fundamentally different from calling a virtual method in C++, nor, for that matter, is it fundamentally different from a return. The processor, in looking ahead, will recognize that a branch via pointer is coming up and will decide if it can, in the prefetch pipeline, safely and effectively resolve the pointer and follow that path. This is obviously more difficult and expensive than following a regular relative branch, but, since indirect branches are so common in modern programs, it's something that most processors will attempt.

As Oli said, "clearing" the pipeline would only be necessary if there was a mis-prediction on a conditional branch, which has nothing to do with whether the branch is by offset or by variable address. However, processors may have policies that predict differently depending on the type of branch address -- in general a processor would be less likely to agressively follow an indirect path off of a conditional branch because of the possibility of a bad address.

Hot Licks
  • 47,103
  • 17
  • 93
  • 151
  • 3
    For elaboration, say a processor is trying to fetch four or more instructions every cycle (which they do). Branches pose a problem as the instruction pointer has now moved oddly, so the processor has to predict where it will go. It uses two structures: the branch predictor and the branch target buffer (BTB). For indirect branches, the real cost is in the BTB, as it has often just stored the last target for the branch. So the next time the processor encounters the branch instruction (in this case, call of the function pointer), the processor assumes the same call as last time is being made. – Brian May 25 '12 at 17:50
  • Yeah, I forgot about the branch target buffer stuff (it's been awhile since I mucked with this stuff). It's a sort of a cache (the exact nature of which I never dug into), so the processor can look there rather than looking at the actual branch address (which may not be available yet). This is, in fact, a hair like conditional branch prediction in that the assumption that the same target will be used again can be wrong, and so it can indeed result in the need to "clear" the pipeline to the extent that code has been pre-executed down the predicted path. – Hot Licks May 25 '12 at 21:08
  • 1
    Branch predictors deal with static branches (the offset is in the instruction stream) very differently from indirect branches (the target is in a register). If the branch is static -- like an ordinary function call, or an unconditional GOTO (they're the same thing in microcode), then the branch predictor can get it right 100% of the time by the instruction decode stage of the pipeline. If the destination address is in a register, the branch unit has no way to know where the jump will go until the target has been calculated. At best the predictor can guess it'll go the same place as last time. – Crashworks Jun 05 '12 at 00:11
  • CPUs always take their prediction result and follow it fully aggressively. AFAIK, no CPUs do any "weaker" stuff when they're not sure of predictions. See [Why not just predict both branches?](//stackoverflow.com/a/49637475), and the discussion in comments (moved to chat) about using prediction confidence to control prefetch (for CPUs that don't actually *execute* speculatively). – Peter Cordes Apr 29 '18 at 12:22
  • @PeterCordes - I'm pretty sure that few general-purpose CPUs will perform a predicted store unless they are sure of the prediction. – Hot Licks May 27 '18 at 12:54
  • @HotLicks: the store buffer enabled speculative stores to execute, delaying global visibility (commit to L1d) until after the store retires (i.e. becomes non-speculative). This, and hiding latency for cache-miss stores, are the main purposes of a store buffer. See [Out-of-order execution vs. speculative execution](https://stackoverflow.com/a/49661172), and for more basic stuff see [this answer](https://stackoverflow.com/a/25694683/224132), and [How does memory reordering help processors and compilers?](https://stackoverflow.com/q/37725497), – Peter Cordes May 27 '18 at 20:11
  • @HotLicks: This answer I wrote a while ago is probably the best explanation of how speculative stores are handled: [Out-of-order instruction execution: is commit order preserved?](https://stackoverflow.com/q/39670026), with loads having to probe the store buffer to maintain the illusion of executing in program order. – Peter Cordes May 27 '18 at 20:25
  • 2
    CPUs generally always follow their predictions aggressively, but it doesn't mean that there can't be different kind of mispredicts: modern chips might have a variety of "severities" of mis-predictions. For example, on chips with complex, slow predictors, the front-end might might first do a fast prediction based on the current fetch address, with a slower but better prediction arriving a cycle or two later. This type of re-steer can potentially be handled completely in the front-end, causing a few cycles of bubble in fetch/decode, but not involving the ROB or any complex OoO state rollback. – BeeOnRope May 27 '18 at 21:38
  • Then you might detect a misprediction in the decode state: after you decode an instruction and realize (for example), that it's an unconditional jump but that you didn't predict it to be a jump at all (happens 100% of the time the first time a jump is seen). Again, the instructions haven't "escaped" the front-end, so you can probably handle it in the front-end, with a few lost cycles (5 or 6 on Intel) in the FE, but no rollback of ROB entries or executed instructions. Then, finally you have "full" mispredictions discovered at execution, and these are, AFAIK, usually mostly equivalent in cost. – BeeOnRope May 27 '18 at 21:41
4

A call through a function pointer doesn't necessarily cause a pipeline clear, but it may, depending on the scenario. The key is whether the CPU can effectively predict the destination of the branch ahead of time.

The way that modern "big" out-of-order cores handle indirect calls1 is roughly as follows:

  • Once you've executed the indirect branch a few times, the indirect branch predictor will try to predict the address to which the branch will occur in the future.
  • The first indirect branch predictors were very simple, capable of "predicting" only a single, fixed location.
  • Later predictors including those on most modern CPUs are much more complex, often capable of predicting well a repeated pattern of indirect jumps and also correlating the jump target with the direction of earlier conditional or indirect branches.
  • If the prediction is successful, the indirect call has a cost similar to a normal direct call, and this cost is largely "out of line" with the rest of the code (i.e., doesn't participate in dependency chains) so the impact on final runtime of the code is likely to be small unless the calls are very dense.
  • On the other hand, if the prediction is unsuccessful, you get a full misprediction, similar to a branch direction misprediction. You can't put a fixed number on the cost of this misprediction, since it depends on the surrounding code, but it usually causes a bubble of about 20 cycles in the front-end, and the overall cost in runtime often ends up similar.

So given those basics we can make some educated guesses at what happens in some specific scenarios:

  1. A function pointer always points to the same function will almost always1 be well predicted and cost about the same as a regular function call.
  2. A function pointer that alternates randomly between several targets will almost always be mispredicted. At best, we can hope the predictor always predicts whatever target is most common, so in the worst-case that the targets are chosen uniformly at random between N targets the prediction success rate is bounded by 1 / N (i.e., goes to zero as N goes to infinity). In this respect, indirect branches have a worse worst-case behavior than conditional branches, which generally have a worst-case misprediction rate of 50%2.
  3. The prediction rate for a function pointer with behavior somewhere in the middle, e.g., somewhat predictable (e.g., following a repeating pattern), will depend heavily on the details of the hardware and the sophistication of the predictor. Modern Intel chips have quite good indirect predictors, but details haven't been publicly released. Conventional wisdom holds that they are using some indirect variant of an TAGE predictors used also for conditional branches.

1 A case that would mispredict even for a single target include the first time (or few times) the function is encountered, since the predictor can't predict indirect calls it hasn't seen yet! Also, the size of prediction resources in the CPU is limited, so if the function pointer hasn't been used in a while, eventually the prediction resources will be used for other branches and you'll suffer a misprediction the next time you call it.

2 Indeed, a very simple conditional predictor that simply predicts the direction most often seen recently should have a 50% prediction rate on totally randomly branch directions. To get significantly worse than 50% result, you'd have to design an adversarial algorithm which essentially models the predictor and always chooses to branch in the direction opposite of the model.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
1

There's not a great deal of difference between a function-pointer call and a "normal" call, other than an extra level of indirection. So potentially there's a greater latency involved; if the destination address is not already in cache or registers, then the CPU potentially has to wait while it's retrieved from main memory.

So the answer is; yes, the pipeline can stall, but this is no different to normal function calls. And as usual, mechanisms such as branch prediction and out-of-order execution can help minimise the penalty.

Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
  • 2
    Sure, I understand that the pipeline may stall if the relevant instructions aren't readily available (in the L1 or L2 cache). But a stall is different than a clear. In a clear, the pipeline has already started preparing for to execute one instruction, then suddenly realizes it'll be doing different work entirely. It needs to throw out the partial work already done, and start re-filling the pipeline with a new task. – abelenky May 25 '12 at 15:49
  • 1
    @abelenky: AFAIK, pipeline only needs to be cleared in situations like branch mis-prediction. – Oliver Charlesworth May 25 '12 at 15:57
  • 3
    No, the pipeline needs to be cleared on any kind of mispredict, including branch target misprediction. If it gets the target wrong, it will have loaded and likely started speculatively executing instructions from the wrong location, and those must be flushed. – BeeOnRope Jan 23 '13 at 05:00