Using final to reduce virtual method overhead

Question

I came across this SO question about how "final" keyword can be used to reduce virtual method overhead(Virtual function efficiency and the 'final' keyword). Based on this answer the expectation is that a derived class pointer calling overriden methods marked with final will not face the overhead of dynamic dispatch.

To benchmark the benefit of this method, I setup a few sample classes and ran it on Quick-Bench - Here is the link. There are 3 cases here:
Case 1: Derived class pointer without final specifier:

Derived* f = new DerivedWithoutFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times

Case 2: Base class pointer with final specifier:

Base* f = new DerivedWithFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times

Case 3: Derived class pointer with final specifier:

Derived* f = new DerivedWithFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times

Here the function run_multiple looks as follows:

int run_multiple(int times) specifiers {
    int sum = 0;
    for(int i = 0; i < times; i++) {
        sum += run_once();
    }
    return sum;
}

The results I observed were:
By Speed: Case 2 == Case 3 > Case 1

But shouldn't Case 3 be much faster than Case 2. Is there something wrong in my experiment design or my assumptions about the expected result?

Edit: Peter Cordes pointed out some really useful articles for further reading related to this topic:
Is final used for optimization in C++?
Why can't gcc devirtualize this function call?
LTO, Devirtualization, and Virtual Tables

Running `100` iterations doesn't seem like very many to me. Also generating random numbers is slow and that could easily overshadow virtual dispatch which should be very fast even if it is slower then static dispatch. — Galik, Dec 03 '18 at 23:34
Quickbench runs a for loop multiple times AFAIK. Nonetheless here is the Quickbench link with a tunable parameter set to 1000 iters. The result is still the same: http://quick-bench.com/zCgn3B18JlSxBZGisEdsxUNjD7U — tangy, Dec 03 '18 at 23:37
What I sometimes do is generate all the random numbers before starting the test and then picking them out one at a time from an array (wrapping round when I run out). — Galik, Dec 03 '18 at 23:40
@Galik: The word you want is "wrapping". The other word you used has a very different meaning (and pronunciation). — Peter Cordes, Dec 03 '18 at 23:41
The difference in Case 1 vs. Case 2 and 3 has increased - but 2/3 are still equal: http://quick-bench.com/BoZrIV3T6l7F98FlkLHR1kelQR0 — tangy, Dec 03 '18 at 23:58
I suppose the compiler could be using the fact that it knows what type you instantiated at compile time and acted accordingly. Maybe you have to instantiate your object from an opaque factory, possibly basing the type on a random number? — Galik, Dec 04 '18 at 00:03

score 3 · Accepted Answer · answered Dec 04 '18 at 00:36

You're correctly understanding the effects of final (except maybe the inner loop of case 2), but your cost estimates are way off. We shouldn't expect a big effect anywhere because mt19937 is just plain slow and all 3 versions spend most of their time in it.

The only thing that's not lost / buried in noise / overhead is the effect of inlining int run_once() override final into the inner loop in FooPlus::run_multiple, which both Case 2 and Case 3 run.

But Case 1 can't inline Foo::run_once() into Foo::run_multiple(), so there's function-call overhead inside the inner loop, unlike the other 2 cases.

Case 2 has to call run_multiple repeatedly, but that's only once per 100 runs of run_once and has no measurable effect.

For all 3 cases, most of the time is spent dist(rng);, because std::mt19937 is pretty slow compared to the extra overhead of not inlining a function call. Out-of-order execution can probably hide a lot of that overhead, too. But not all of it, so there's still something to measure.

Case 3 is able to inline everything to this asm loop (from your quickbench link):

 # percentages are *self* time, not including time spent in the PRNG
 # These are from QuickBench's perf report tab,
 #  presumably sample for core clock cycle perf events.
 # Take them with a grain of salt: superscalar + out-of-order exec
 #  makes it hard to blame one instruction for a clock cycle

   VirtualWithFinalCase2(benchmark::State&):   # case 3 from QuickBench link
     ... setup before the loop
     .p2align 3
    .Louter:                # do{
       xor    %ebp,%ebp          # sum = 0
       mov    $0x64,%ebx         # inner = 100
     .p2align 3  #  nopw   0x0(%rax,%rax,1)
     .Linner:                    # do {
51.82% mov    %r13,%rdi
       mov    %r15,%rsi
       mov    %r13,%rdx           # copy args from call-preserved regs
       callq  404d60              # mt PRNG for unsigned long
47.27% add    %eax,%ebp           # sum += run_once()
       add    $0xffffffff,%ebx    # --inner
       jne    .Linner            # }while(inner);
       mov    %ebp,0x4(%rsp)     # store to volatile local:  benchmark::DoNotOptimize(x);
0.91%  add    $0xffffffffffffffff,%r12   # --outer
       jne                    # } while(outer)

Case 2 can still inline run_once into run_multiple because class FooPlus uses int run_once() override final. There's virtual-dispatch overhead in the outer loop (only), but this small extra cost for each outer loop iteration is totally dwarfed by the cost of the inner loop (identical between Case 2 and Case 3).

So the inner loop will be essentially identical, with indirect-call overhead only in the outer loop. It's unsurprising that this is unmeasurable or at least lost in noise on Quickbench.

Case 1 can't inline Foo::run_once() into Foo::run_multiple(), so there's function-call overhead there, too. (The fact that it's an indirect function call is relatively minor; in a tight loop branch prediction will do a near-perfect job.)

Case 1 and Case 2 have identical asm for their outer loop, if you look at the disassembly on your Quick-Bench link.

Neither one can devirtualize and inline run_multiple. Case 1 because it's virtual non-final, Case 2 because it's only the base class, not the derived class with a final override.

        # case 2 and case 1 *outer* loops
      .loop:                 # do {
       mov    (%r15),%rax     # load vtable pointer
       mov    $0x64,%esi      # first C++ arg
       mov    %r15,%rdi       # this pointer = hidden first arg
       callq  *0x8(%rax)      # memory-indirect call through a vtable entry
       mov    %eax,0x4(%rsp)  # store the return value to a `volatile` local
       add    $0xffffffffffffffff,%rbx      
       jne    4049f0 .loop   #  } while(--i != 0);

This is probably a missed optimization: the compiler can prove that Base *f came from new FooPlus(), and thus is statically known to be of type FooPlus. operator new can be overridden, but the compiler still emits a separate call to FooPlus::FooPlus() (passing it a pointer to the storage from new). So this just seems to be a cast of clang not taking advantage in Case 2 and maybe also Case 1.

"But Case 1 can't inline Foo::run_once() into Foo::run_multiple(), so there's function-call overhead inside the inner loop, unlike the other 2 cases." What are the rules which dictate when inlining of a virtual function call is feasible? Is adding the "final" qualifier a strong hint for this? — tangy, Dec 04 '18 at 01:02
@tangy: the compiler has to be able to prove that no derived type could have overriden it. So either it has to be `final`, or in the context of that call site it has to be able to prove that it has an actual `FooPlus`, and definitely not something derived from it that could have its own override. (Or some compilers will speculatively devirtualize in cases where they think they know the type, by creating an inlined loop which runs if the virtual function pointer matches the stand-alone definition it inlined, otherwise using another block of asm that uses the vtable normally.) — Peter Cordes, Dec 04 '18 at 01:10
@Peter Cordes:" ... the compiler can prove that Base *f came from ..." ,the compiler cannot prove that a base* came from a particular derived* ,because the base class may or may not be used in other TU's. Or am I wrong about it. — engf-010, Dec 04 '18 at 01:54
@engf-010: Yes, that's the case for a function that only gets a pointer, like the `this` pointer in `Foo::run_multiple()`. But if you do `Foo obj;` `Foo *ptr = &obj;`, the compiler is 100% sure that it has a `Foo`, not something derived from `Foo`. It created the object being pointed-to inside this function, so it knows everything about it. It's trivial to devirtualize it so `ptr->run_multiple(100)` should compile to the same code as `obj.run_multiple(100)`. — Peter Cordes, Dec 04 '18 at 01:58
Related: [Is final used for optimization in C++?](https://stackoverflow.com/q/37414995), and [Why can't gcc devirtualize this function call?](https://stackoverflow.com/q/48906338), and one of the gcc developers (Jan Hubicka) posted and answer on [LTO, Devirtualization, and Virtual Tables](https://stackoverflow.com/a/21436528) with a link to a blog article. — Peter Cordes, Dec 04 '18 at 02:00
Thanks Peter. I've added the links you suggested at the end of the question itself as an edit so its easier to lookup. — tangy, Dec 04 '18 at 14:35

Using final to reduce virtual method overhead

1 Answers1