Inlining of virtual functions (Clang vs GCC)

Question

Take this stupid example:

class Base
{
  public:
    virtual void ant() { i++; };
    virtual void dec() { i--; };
  int i;
};

void function(Base * base) 
{ 
  base->ant(); 
  base->dec(); 
}

The way I imagined this to be implemented by the compiler was two virtual function calls. Clang does just that (using a tail call for the call to dec()):-

function(Base*):                      # @function(Base*)
        push    rbx
        mov     rbx, rdi
        mov     rax, qword ptr [rbx]
        call    qword ptr [rax]
        mov     rax, qword ptr [rbx]
        mov     rdi, rbx
        pop     rbx
        jmp     qword ptr [rax + 8]     # TAILCALL

GCC on the other hand goes a long way out of it's way to partially inline the function calls:

Base::ant():
        add     DWORD PTR [rdi+8], 1      # this_2(D)->i,
        ret
Base::dec():
        sub     DWORD PTR [rdi+8], 1      # this_2(D)->i,
        ret
function(Base*):
        push    rbx     #
        mov     rax, QWORD PTR [rdi]      # _3, base_2(D)->_vptr.Base
        mov     rbx, rdi  # base, base
        mov     rdx, QWORD PTR [rax]      # _4, *_3
        cmp     rdx, OFFSET FLAT:Base::ant()      # _4,
        jne     .L4       #,
        mov     rax, QWORD PTR [rax+8]    # _7, MEM[(int (*__vtbl_ptr_type) () *)prephitmp_8 + 8B]
        add     DWORD PTR [rdi+8], 1      # base_2(D)->i,
        cmp     rax, OFFSET FLAT:Base::dec()      # _7,
        jne     .L6       #,
.L12:
        sub     DWORD PTR [rbx+8], 1      # base_2(D)->i,
        pop     rbx       #
        ret
.L4:
        call    rdx     # _4
        mov     rax, QWORD PTR [rbx]      # _3, base_2(D)->_vptr.Base
        mov     rax, QWORD PTR [rax+8]    # _7, MEM[(int (*__vtbl_ptr_type) () *)prephitmp_8 + 8B]
        cmp     rax, OFFSET FLAT:Base::dec()      # _7,
        je      .L12        #,
.L6:
        mov     rdi, rbx  #, base
        pop     rbx       #
        jmp     rax       # _7

Is this really advantageous, and why? It looks like the same number of memory lookups but with an additional comparison thrown in the mix. Perhaps my example is too trivial.

See godbolt for a version you can play with.

[Ant and Dec](https://en.wikipedia.org/wiki/Ant_%26_Dec) are fun names for an example, but couldn't you have made `dec()` be the one that decrements, and can compile to a `dec` instruction! Here's a [link to that code on Godbolt](https://godbolt.org/g/UIq2Gc), with the ++ and -- swapped. (You should edit that into your question, preferably with the full link, not the shortened link so link-rot can't kill it) — Peter Cordes, Jun 18 '16 at 00:13
Holy cow! I thought I'd made dec decement, but I hadn't! Fixed :) — JCx, Jun 19 '16 at 20:59
@PeterCordes - just curious, what's wrong with the short links? — BeeOnRope, Jun 21 '16 at 01:55
@BeeOnRope: goo.gl's database can forget them. This has actually happened to short links in some of my old answers. See also http://meta.stackoverflow.com/a/319594/224132. — Peter Cordes, Jun 21 '16 at 02:02
Hmmm, but godbolt short links use a godbolt domain, like `https://godbolt.org/g/7GFfau`. So at least you don't introduce a new dependency on a third party shortener (of course it's possible that Matt internally uses a third party). Honestly it's shocking to me that goo.gl would forget links. That's a microscopic amount of information to archive and one thing Google is good at is archiving information cheaply. — BeeOnRope, Jun 21 '16 at 02:06

score 7 · Accepted Answer · edited May 23 '17 at 12:02

As you already figured out, gcc is trying to optmistically inline the functions. That is, inline their bodies after checking that the functions are not in fact overridden: it does this by comparing the value of the function pointers in their vtable¹. This type of optimization may be referred to as speculative devirtualization and is widely applied in JIT-compiled languages such as C# and Java, while being more difficult, less profitable and less often applied in compiled lanaguages such as C++. Here we use speculative to distinguish it from the variant where the compiler can prove that the devirtualization is possible. In the optmistic case we instead hope that it is valid, and double check at runtime.

This whole topic could fill a book (well, at least a paper or two), so if you want all the gory details on how the various types of devirtualization are implemented, take a look at Hanza's 7 part series on devirtualization. He helped implement in gcc⁴ and a lot of the content there goes directly to this question. Here I'll focus on some specifics of your example.

Overall, it's a pretty interesting optimization, that could pay big dividends in certain cases. In your specific case, however, it seems a bit questionable. First, note that falls into a class of probabilistic optimizations: optimizations whose benefit or penalty depends on the runtime behavior of the application: in particular, whether the base argument to type actually overrides the ant() and dec() methods. Whether the optimization is profitable will swing from "strictly a bad idea" to "looks somewhat OK" depending on the actual frequencies.

For example, if the actual behavior of the application is to always pass a base with the default ant() and dec() methods shown, the benefits are:

A somewhat lower latency code path. Ignoring common work (such as loading the function pointers from the vtable, and the actual increment), the inlined approach basically adds two cmp + jne pairs, but saves one indirect call, one indirect jmp and two rets. Just counting instructions that is likely to be a wash, but in practice two macro-fused cmp/jmp pairs are going to be very cheap, and also "out of line" with respect to the increment operations, so their latency cost is probably completely hidden. The indirect jumps and ret calls are less likely to be hidden: modern x86 like Haswell is still good at executing these quickly, but there are hard limits like one taken branch per cycle, etc.

That said, in this case, the difference between the two paths is likely to be fairly small. The two RMW inc and dec operations are likely to take longer than the jumping around.
The branch prediction behavior is likely to be better, most notably when this code is cold. When cold, there is no info in any of the predictors about the likelihood of branches being taken, or their targets (for the case of indirect branches). The two jne calls in inlined case will likely default predicted not-taken, and be predicted correctly², and the indirect branches will be avoided entirely. The always-virtual-call approach, however, has no chance of correctly predicting the branch target, so you are going to suffer two consecutive branch mispredictions.

On the other hand, when the ant() and dec() methods are always overridden, the optimization is a dead-loss. You added the extra comparisons and taken jumps to the execution path, and then you had to do the indirect call anyway. Furthermore, you bloated the code size by a huge (47 bytes for the gcc version vs. 13 bytes for the clang one), relatively speaking. If your runtime code footprint is large, this will really hurt you in terms of I$ misses.

Peter mentions that even when the checks fail, at least the optimization reduces the number of targets (by 1) for the subsequent indirect branch. However, you still need to include the mispredicts for the cmp/jne call in the analysis. When you do that, it seems like the check-base-first approach is always either tied or worse - at least in terms of expected mispredicts.

A couple of quick examples:

Two targets (x, y) for ant() with probabilities p(x) and p(y).

Assume p(x) > p(y) without loss of generality¹.

Check first:

jne predicts x always so expected jne misses: p(y)
indirect call predicts y always, with zero expected misses expected
misses = p(y)

Virtual call only:

BTB predicts x
expected misses: p(y)

So both cases are exactly the same, with p(y) expected misses.

Three targets, with probabilities x, y, z (we skip the p(x) notation) Case x > y + z

Assume for simplicity that that y == z. You can work through the case where y != z. It doesn't change the qualitative conclusions.

Case x > y + z

Check first:

jne predicts taken (x) always so expected jne misses: y + z = 2y
indirect call predicts y, with z (==y) expected misses
misses = x*(0 + 0) + y*(1 + 0) + z*(1 + 1)
= y + 2z = 3y

Virtual call only:

BTB predicts x
misses = x*0 + y + z = 2y

So in this case, check-first is dominated by virtual-only in mispredict chance, suffering 50% more mispredicts (proportional to the probability of the two less common targets). At the boundaries, when p(y) == p(z) == 0, it ties (but that means there aren't really three targets), and when p(x) == p(y) + p(z) it suffers an expected 0.75 mispredicts per call compared to 0.5 for the virtual-call-only approach.

Let's check the x < y + z case.

Check first:

jne predicts not taken (y or z) always so expected jne misses: x
indirect call predicts y always, with z expected misses
misses = x*(1 + 0) + y*(0 + 0) + z*(0 + 1)
= x + z

Virtual call only:

BTB predicts x
expected misses: y + z

So here again the virtual-call dominates the check-first approach. The latter suffers p(x) - p(y) additional misses. The worst case seems to be when p(x) == ~0.49... and p(y) == p(z) == ~0.25 misses, where again the virtual approach suffers ~0.25 additional misses per call. The other boundary conditions is again a tie when p(z) == 0 (expected, since that's the two-target case).

Now all of the above assume that taken vs not-taken branch mispredicts are equal in cost to branch targets mispredicts. On modern x86 CPUs the cost does in fact seem to be about the same. Each type of mispredict causes a full redirect and a penalty of 15-20 cycles. There are still second order effects present - for example, an indirect branch may be resolved later than a conditional direct one if the jump address takes longer to calculate than the branch condition. That doesn't seem to be the case here since both the branch decision and the branch address depend on the same thing: the address of the ant() function in the vtable.

The above also assumes that indirect branches are equally well predicted compared to conditional branches, and that this prediction consumes an equal amount of resources. This is, in general, not true. Processors generally have fewer resources dedicated to indirect BTB entries, and even given equal resources, getting to a given prediction rate requires more resources in the IBTB scenario versus the conditional branch scenario, since IBTB state (target) is larger than a single bit of taken not-taken info. Smaller or older CPUs may not have any indirect branch prediction capability as well, but this difference is real even in modern x86 CPUs. It is hard to quantify, but at the limit, when this function is called a lot, the difference disapears (since there are enough resources to track accurately IBTB calls for at least the hottest calls).

If you got this far, you might well fell that, overall, this seems like a questionable optimization, in this specific case. The potential upside is fairly small, and the cost in terms of code bloat is large. Furthermore, in the intermediate cases (where base.ant() is sometimes equal to Base::ant), the proposition is questionable since the increased mispredicts eat into the benefit of inlining the call.

On the face of it, I would tend to agree - but there are a couple of mitigating factors:

First, gcc is actually trying to be smart about when it applies this optimization. It applies this optimization only when it can't see that the methods in question are actually overloaded. Take a look at this small modification of your example. The function is unchanged, but we add an (unused in this file) subclass of Base which does override the methods. Now, gcc no longer does the speculative inlining. It sees that the method is overridden, so it doesn't find the inlining worth it.

This whole idea of what can gcc see is pretty important. Usually gcc is looking only at a single compilation unit. That greatly restricts its visibility. You can make a reasonable argument that, in general, Base would be overridden in a separate compilation unit, if at all, so the behavior noted above (gcc deciding to apply the speculative inlining based on whether an override exists) isn't too useful, since same-file overrides are rare.

On the other hand, you might note that goldbolt is putting your class definition in a .cpp file⁵. It is very rare and bad practice for someone to include a .cpp file in another compilation unit. So the guess that Base isn't overridden is perhaps a pretty good one in this case. Of course, I don't gcc actually uses that info - by the time the actual compiler sees the file, the distinction between header and .cpp files has pretty much been lost.

Whether a class is likely to be overridden, when your scope is a single compilation unit is nearly as much a philosophical question as a technical one. This is exactly the kind of question that LTO and PGO (as Peter mentioned) is supposed to solve. In the LTO case, by deferring optimization to link-time, the entire set of statically available⁶ classes and method overrrides is available for examination. In the case of PGO, the compiler can use information about which classes actually appear as targets at every call site, and optimize based on the observed overrides. It can even make different decisions for different call sites (assuming function() itself can be inlined). PGO can approach the quality of devirtualization that is usually reserved for JIT-compiled languages.

In fact, this topic is important enough that Jan gives it an entire entry on his series about devirtualization. Of particular importance, he covers cases where the compiler can be sure that it knows there are no subclasses/method overrides, and hence devirtualizaton is no longer speculative: classes in anonymous namespaces, local classes, and final methods and classes.

A final note is in defense of gcc's decision to speculatively inline this. The example given is more or less at one end of the risk-vs-reward spectrum for this optimization. As it turns out, the functions involved only take a single instruction to implement (aside from the required ret, one of which is even optimized away by the tailcall). So the benefit of inlining is very small, because the benefit of inlining is mostly as a "granddaddy" optimization that enables lots of other optimizations to work, and often eliminates a large part of the cost of the inlined function. Because the function is so small, with zero prologue and epilogue, and because can't easily optimize between ant() and dec(), because they are called in different basic blocks, inlining doesn't help very much.

Here's another example which gives gcc more of a chance to optimize:

class Base
{
  public:
    virtual int add5(int x) { return 5 + x; };
};

int function(Base * base, int x) 
{ 
  return base->add5(x) - base->add5(x);
}

int add5_static(int x) {
  return 5 + x;
}

int function2(int x) {
  return add5_static(x) - add5_static(x);
}

int function3(int x) {
  Base b = Base();
  return b.add5(x) - b.add5(x);
}

Here, you call the same function more than once. This could allow gcc to optimize the function pointer checks (you only need one for add5 not two for ant and dec). This could allow gcc to optimize between the two function calls, replacing(5 + x) - (5 + x)with something as simple as0`.

Let's see what gcc does with that, via godbolt! Looks good initially. The inlined version of the call only requires one cmp/jne since the same function is called twice. Here's the optimized version, starting with the function pointer check:

    cmp     rax, OFFSET FLAT:Base::add5(int)  # _4,
    jne     .L3       #,
    add     esi, 5    # _11,
    mov     ebp, esi  # _7, _11
    mov     eax, ebp  # _7, _7
    pop     rbx       #
    sub     eax, esi  # _7, _11
    pop     rbp       #
    pop     r12       #
    ret

It's a mixed bag. There are 7 instructions after the jump, and a lot of redundancy. First, note that gcc is able to do inter-call optimization. In particular, the single call to add esi, 5 shows that the common sub-expression 5 + part of the two (identical calls) was optimized to a single call. You then get eax = ebp = esi. The assignment to ebp is redundant - it is never used again in this function. Then you get sub eax,esi. This is totally redundant because eax == esi, and so the result is always zero.

Even the pop instructions are all redundant. We simply push those three registers at the top of the method, then pop them off in the optimized function. It is legitimate to push these registers before we call the virtual methods, but that can all be deferred until after the check. So a total of six push and pop instructions are unnecessary in the optimized path.

All that to say that the optimized implementation of these 7 instructions is simply xor eax, eax, a near-free instruction (in addition to avoiding the three earlier push instructions, not shown, which can be also avoided⁷). Gcc missed all the easy optimizations, and the optimized function is perhaps an order of magnitude slower. The reality is that even though these optimizations are "obvious", everything occurs in stages. The stage that can remove all the redundant moves, the subtraction and the pushes and pops probably occurs before the devirtualization phase. Later on, it is too late for this redundancy to be removed.

Just to show that gcc is indeed capable of optimizing this in a better way, take a look at function 2:

int add5_static(int x) {
  return 5 + x;
}

int function2(int x) {
  return add5_static(x) - add5_static(x);
}

This is the same as function, except without the complication of possibly virtual function calls. The entire function is simply:

function2(int):
    xor     eax, eax  #
    ret

So everything cancels out, into return 0;. The compiler could do the same thing in the speculatively devirtualized version of add5, but it fails to.

Oddly enough, the non-speculative devirtualization does just fine:

int function3(int x) {
  Base b = Base();
  return b.add5(x) - b.add5(x);
}

reduces to exactly the same thing:

function3(int):
    xor     eax, eax  #
    ret

So something about speculative devirtualization seems to make it less susceptible to many of the optimizations that would otherwise lead to some great simplifications of the code and lead to some big wins. Even without those, the optimized version is likely to be noticeably faster than the version without devirtualization.

¹It is worth noting that this is a more precise check than typical checks than what occurs when devirtualizing JITed languages. In Java, for example, the check is simply against the object type (i.e., the value of the vtable pointer), rather than against the particular function implementation (i.e., against the value of the function pointer in the vtable). The function-specific check is more costly, since it involves dereferencing the vtable pointer, but it "works" in many more cases: it would work for any subclass of Base that didn't actually override inc() and/or dec(), while the class type check would fail completely.

The difference is probably down to when the optimization is applied: because the JITed code knows the most common class(es) at the call site based on profiling, even the coarser (and cheaper) check is going to be effective, since it uses the most common observed type. The C++ code doesn't have this benefit (unless LTO and PGO are used), so the more careful check probably pays off.

²In fact, the actual behavior is complex and depends specifically on the exact processor version. In older processors, the branch predictors were actually much more likely to use the default predictor for cold branches (which predicts not-taken for forward branches), since the prediction mechanisms were tied tightly to the IP of the branch.

In more recent predictors (Haswell and newer), the mechanism is more complex, with a hash of several factors being used, so you may often "collide" with other existing state even in the cold case, and so behavior is not as predictable any more. I think it's safe you say you won't usually get more than a 50% mispredict rate for forward, not taken branches, however. You can find some investigation on this blog and discussion here.

³OK, you caught me. It excludes the case p(x) == p(y), but you can work that through too and it doesn't change anything.

⁴A binary search on godbolt as well as Hanza's blog confirms that this was added in gcc 4.9. Prior to that version, it is compiled in the same was as clang and icc.

⁵In particular, the #pragma message shows that the source is all compiled in a single file called example.cpp.

⁶By statically available I mean the set of LTO-enabled files that are available to gcc when the LTO phase occurs. This excludes at least two important categories of code: (1) code in static or dynamic objects which are linked to the final binary but for which LTO infomration is not available (pretty much everything you link which isn't part of your compile process) and (2) additional code loaded at runtime, e.g., via dlopen, which LTO can't possibly account for.

⁷Peter Cordes points out that this "push everything up front" behavior - where everything that needs to be saved, through any possible path through a function is pushed immediately on entry, seems to be a pattern in gcc. In the case of functions with very short fast paths, this seems like a big limitation.

Very nice answer, thanks for posting this. This makes my answer look like the quick / lazy post that it is :P. — Peter Cordes, Jun 21 '16 at 02:21
re: predictability: In practice, branch directions are not random with a uniform distribution. They often have short repeating patterns, which predictors can latch on to. *This* is why I suggested that peeling one case out of the indirect `call` *might* help. — Peter Cordes, Jun 21 '16 at 02:22
Yes, of course, but both the indirect BTB and branch direction predictors (on modern x86, again) have such history-aware prediction mechanisms. In _really modern_ x86 (i.e., Haswell and later, not sure on the AMD side) - this is believed to be something like TAGE and iTAGE, but there is no official word. In any case, both do about the same at predicting patterns. The caveat is that there may be fewer resources on the IBTB side of things, which I touched on above. — BeeOnRope, Jun 21 '16 at 02:25
Re: avoiding the `push`/`pop` on the code path that doesn't need it: gcc always saves/restores all the registers that any path through the function might need. This is one reason why Linux (the kernel) uses `__attribute__((noinline))` for some functions: so the fast-path through a short function doesn't save/restore as many registers, leaving that to a separate function that handles the less-common case. `push` and `pop` are only single-byte instructions, and decode to a single fused-domain uop on modern CPUs, but it's still nice to avoid them. gcc 5.3 makes tighter code than 6.1, with ebp — Peter Cordes, Jun 21 '16 at 02:25
re: patterns: remember that most of gcc's optimization passes are *not* x86-specific. They have tuning knobs that are set differently for different targets, but just setting weights / costs doesn't always tune perfectly for `-march=haswell`. Many CPUs (including Atom, IIRC) have weak indirect branch predictors that don't do well with patterns. So I suspect that gcc being over-eager to speculatively devirtualize is a sign of gcc's age / heritage. Oh, or maybe not if this was only added in 4.9. Still, it might help more for ARM than for x86, and maybe isn't tuned perfectly yet for x86. — Peter Cordes, Jun 21 '16 at 02:29
I think your text doesn't match your code in one spot: ... replacing `(5 + x) + (5 + x)` with `(10 + 4x)`. First, that should be 2x. More importantly, your code is *subtracting* the the two function results, which is why even functions 2 and 3 can return 0. (At first I thought function3 was returning 0 for the uninteresting reason that it just modified a local, but the local is just to get an instance for the virtual function). You're also missing a `\`` on a `dec` near there. Anyway, very interesting observation that gcc fails to optimize between the calls. — Peter Cordes, Jun 21 '16 at 02:41
Also, `xor eax,eax` isn't free. It still takes a fused-domain uop, and 2 bytes of code-size. It also takes an execution port on CPUs other than Intel SnB-family. Even a `nop` isn't free; execution ports aren't the only thing that modern CPUs bottleneck on. (In fact, modern Intel CPUs rarely bottleneck on execution ports in code with a good mix of instructions (because they have so many ALUs); more often loops bottleneck on the frontend or the 4 fused-domain uops per clock pipeline width, in code that doesn't bottleneck on latency, memory, or branch mispredicts.) — Peter Cordes, Jun 21 '16 at 02:49
@PeterCordes - huh, well it seems like that's a really bad idea. It's doing a ton of work up-front that it doesn't need in the fast path, and then undoing all that (the `pop`s) in the fast-path too. Based on a few tests it seems like gcc does do this (pushing up front) in general. Worse, it doesn't even use any of the pushed registers in the fast path, and their use in the slow path is basically just to save registers that might otherwise be clobbered (i.e., they are scratch in the ABI). Seems like a big loss. Oddly, icc and clang seem to behave pretty similarly. — BeeOnRope, Jun 21 '16 at 02:50
re: function layout: I forgot to say this earlier: I think stack-unwinding / exception handling is part of the reason for this. compilers have to record changes to the size of the stack frame in the `.eh_frame_hdr` section, so exception handlers and debuggers can unwind the stack, given only the current values of RIP and RSP. (This is what makes `-fomit-frame-pointer` possible.) IDK if smarter register save/restore placement would maybe bloat that metadata, or if it's just a good optimization that gcc doesn't know how to do. Inlining usually creates large-ish functions where it's ok. — Peter Cordes, Jun 21 '16 at 02:56
@PeterCordes - good catch on the `(10 + 4x)`. I had edited the example several times to try to coax gcc into a better optmization, but didn't update the text. I am updating the text. — BeeOnRope, Jun 21 '16 at 02:58
Also, mixing `push`/`pop` with `[rsp + disp]` addressing modes to access locals will cause extra stack-sync uops from the stack engine. This is a big deal in 32bit code where the ABI passes everything on the stack, so it makes sense to get all the `push`es out of the way before loading args. Although I think gcc does tend to interleave the prologue in with the first few instructions of the function, so insns on the critical path can issue sooner. IDK how well-tuned this part of gcc is. gcc can be pretty dumb about using more regs than needed (and thus saving/restoring them). — Peter Cordes, Jun 21 '16 at 03:00
@PeterCordes - regarding "remember that most of gcc's optimization passes are not x86-specific" - very good point. However, I think the idea that the decision to inline being a result of careful evaluation of the potential costs of branch mispredictions in the conditional branch case vs the indirect branch case is giving it a bit too much credit. Most decisions like this are made with heuristics, in the pass where they can apply, and usually only use limited and local machine model information. I would be surprised if it is even _possiblle_ for the the arch model to change this decision. — BeeOnRope, Jun 21 '16 at 03:08
Right regarding the stack sync uop, but my claim here is the entire block of pops can be deferred, at a large runtime and code size benefit. Also, I don't even see any accesses using `[rsp]` - but I scanned it quickly. It seems like there is a lot of low-hanging fruit here. Of course, what seems low-hanging to me is probably very difficult to implement in the compiler, often because of the walls between various phases (and the cost of introducing a re-run of a particular phase). — BeeOnRope, Jun 21 '16 at 03:11
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/115162/discussion-between-beeonrope-and-peter-cordes). — BeeOnRope, Jun 21 '16 at 03:14

score 2 · Answer 2 · answered Jun 18 '16 at 00:25

2

gcc assumes that two predictable branches are cheaper than two indirect calls. That's certainly true if it turns out that it guessed right and the calls it inlined are the ones that should happen. Profile-guided optimization (-fprofile-generate / -fprofile-use) will probably detect if this is not the case, and speculatively inline something else if appropriate. (Or not, maybe it's not that smart).

Indirect branches are potentially quite expensive, since the branch-predictor has to correctly predict the target address. Two not-taken branches are very cheap if they're predictable. I'm a bit surprised that gcc doesn't check that both addresses are the base-class, and not touch i at all in that case (since the inc and dec cancel out.)

Modern CPUs are quite good at blowing through a lot of instructions quickly, when there's a decent amount of instruction-level parallelism like in this case. See Agner Fog's microarch guide, and other links in the x86 tag wiki.

answered Jun 18 '16 at 00:25

Peter Cordes

328,167
45
605
847

Thanks Peter - That helps. But surely it's incurred most of the expense of the indirect call anyway though by looking up the address of ant and dec in order to check if they've changed? – JCx Jun 18 '16 at 18:02
2

It depends. The indirect call has at least two costs: (1) mispredicted target (2) cache miss while accessing the vtable ptr for `base`. Certainly if (2) was actually a problem, it would be incurred by both approaches, since both access the vtable. However, vtable cache misses are unlikely to be a large source of slowdown in a real program, since you'd need to a huge number of different, sparsely accessed classes for it to be an issue. So you are left mostly with (1) - the penalty for a mispredicted indirect branch. That's something like 15 cycles and something the gcc solution doesn't incur. – BeeOnRope Jun 19 '16 at 01:52
2

@PeterCordes - regarding " I'm a bit surprised that gcc doesn't..." - that is actually a fairly tough optimization to make in general. It requires an additional comparison and branch, to save two 1 cycle instructions. It would fail badly if this function was mostly passed objects with one function overloaded. More practically though, these kind of virtual-call-site optimizations are generally local things (i.e., with the scope of that one call) as it is tough to extend them across calls. – BeeOnRope Jun 19 '16 at 01:57
2

Imagine, for example, a function with more virtual calls, a with a few additional pointers passed as arguments. There is a combinatorial explosion of relationships between all the possible overloads, and various opportunities to optimize them together. You'd have to to have very strong hints from PGO to actually take any of them though, since the cost quickly exceeds the "dumb" approaches of making the call. Consider also that many functions pairs don't have many useful additional optimization opportunities anyway - the `dec` cancelling `inc` example is a bit of a perfect case. – BeeOnRope Jun 19 '16 at 02:01
1

@JCx: BeeOnRope didn't @-tag you in his first comment. I don't have much to add to that, except a summary: a memory-indirect call isn't expensive because of having to load the address from cache, it's expensive because it's harder for the branch-predictor to correctly predict (which has to happen way earlier in the pipeline than the actual data fetch). So it seems reasonable to guess that a normal branch on an address compare is likely to have smaller misprediction penalties on average than an indirect call. – Peter Cordes Jun 19 '16 at 04:25
1

In cases where indirect call *does* happen, there's one fewer potential target, so less chance of a hard-to-predict pattern of target addresses. (e.g. if it's always either the base class or one derived class, then the target address of the `call rdx` is always the derived class. Even the simplest branch-target-buffer can predict that well, if it's in the BTB at all.) – Peter Cordes Jun 19 '16 at 04:27
@BeeOnRope: good point about why gcc probably didn't even look at cross-optimizing between the functions. It's probably mostly just made-up cases like this where the extra branch might be worth it. – Peter Cordes Jun 19 '16 at 04:29
1

It is equally as interesting that other compilers (e.g., Clang and ICC) don't make the same assumption as GCC. I certainly wouldn't have were I writing this without taking the time to benchmark. A mispredicted indirect call has a penalty of ~15-30 cycles, which is about the same amount as you'd expect for a mispredicted branch. In repeated calls, both types of code would benefit from branch prediction, and I'd say that the significant code *size* savings from the indirect calls would ultimately be the saving grace. – Cody Gray - on strike Jun 19 '16 at 17:44
1

@CodyGray: Yeah, I'm not convinced this is an obviously-good optimization. IDK how common it is for people to make virtual functions but then have the base-class definition run most of the time. `gcc -O3` is very aggressive about bloating code-size. Maybe this is a good idea; if this code *is* part of a hot loop, then this might help. If it's not part of a hot loop, then it's just one more bloated piece of code that runs occasionally. (gcc autovectorization is particularly bad about bloating code by fully unrolling the unaligned cleanup loops.) – Peter Cordes Jun 19 '16 at 19:49
1

This is a reasonable target for link-time codegen (LTGC), at least when an executable is being produced. At that point, the compiler can check if an overload of Base exists in any compile unit, and then the optimistic inlining would be more justifiable. Of course, even in that scenario you still need the fallback path because Base could be overloaded, e.g. in a dynamically loaded shared object (or a statically linked library for which LTGC is not available). – BeeOnRope Jun 19 '16 at 21:54
@PeterCordes - I commented on the idea that this optimization helps the subsequent indirect call in my answer. It does reduce the number of targets, but the misprection cost for the branches seems to more than cancel out it, in general. About `-O3`, it seems that gcc performs this for both `-O2` and `-O3`, but not `-O1`. – BeeOnRope Jun 21 '16 at 01:36

Inlining of virtual functions (Clang vs GCC)

2 Answers2