You're correctly understanding the effects of final
(except maybe the inner loop of case 2), but your cost estimates are way off. We shouldn't expect a big effect anywhere because mt19937 is just plain slow and all 3 versions spend most of their time in it.
The only thing that's not lost / buried in noise / overhead is the effect of inlining int run_once() override final
into the inner loop in FooPlus::run_multiple
, which both Case 2 and Case 3 run.
But Case 1 can't inline Foo::run_once()
into Foo::run_multiple()
, so there's function-call overhead inside the inner loop, unlike the other 2 cases.
Case 2 has to call run_multiple
repeatedly, but that's only once per 100 runs of run_once
and has no measurable effect.
For all 3 cases, most of the time is spent dist(rng);
, because std::mt19937
is pretty slow compared to the extra overhead of not inlining a function call. Out-of-order execution can probably hide a lot of that overhead, too. But not all of it, so there's still something to measure.
Case 3 is able to inline everything to this asm loop (from your quickbench link):
# percentages are *self* time, not including time spent in the PRNG
# These are from QuickBench's perf report tab,
# presumably sample for core clock cycle perf events.
# Take them with a grain of salt: superscalar + out-of-order exec
# makes it hard to blame one instruction for a clock cycle
VirtualWithFinalCase2(benchmark::State&): # case 3 from QuickBench link
... setup before the loop
.p2align 3
.Louter: # do{
xor %ebp,%ebp # sum = 0
mov $0x64,%ebx # inner = 100
.p2align 3 # nopw 0x0(%rax,%rax,1)
.Linner: # do {
51.82% mov %r13,%rdi
mov %r15,%rsi
mov %r13,%rdx # copy args from call-preserved regs
callq 404d60 # mt PRNG for unsigned long
47.27% add %eax,%ebp # sum += run_once()
add $0xffffffff,%ebx # --inner
jne .Linner # }while(inner);
mov %ebp,0x4(%rsp) # store to volatile local: benchmark::DoNotOptimize(x);
0.91% add $0xffffffffffffffff,%r12 # --outer
jne # } while(outer)
Case 2 can still inline run_once
into run_multiple
because class FooPlus
uses int run_once() override final
. There's virtual-dispatch overhead in the outer loop (only), but this small extra cost for each outer loop iteration is totally dwarfed by the cost of the inner loop (identical between Case 2 and Case 3).
So the inner loop will be essentially identical, with indirect-call overhead only in the outer loop. It's unsurprising that this is unmeasurable or at least lost in noise on Quickbench.
Case 1 can't inline Foo::run_once()
into Foo::run_multiple()
, so there's function-call overhead there, too. (The fact that it's an indirect function call is relatively minor; in a tight loop branch prediction will do a near-perfect job.)
Case 1 and Case 2 have identical asm for their outer loop, if you look at the disassembly on your Quick-Bench link.
Neither one can devirtualize and inline run_multiple
. Case 1 because it's virtual non-final, Case 2 because it's only the base class, not the derived class with a final
override.
# case 2 and case 1 *outer* loops
.loop: # do {
mov (%r15),%rax # load vtable pointer
mov $0x64,%esi # first C++ arg
mov %r15,%rdi # this pointer = hidden first arg
callq *0x8(%rax) # memory-indirect call through a vtable entry
mov %eax,0x4(%rsp) # store the return value to a `volatile` local
add $0xffffffffffffffff,%rbx
jne 4049f0 .loop # } while(--i != 0);
This is probably a missed optimization: the compiler can prove that Base *f
came from new FooPlus()
, and thus is statically known to be of type FooPlus
. operator new
can be overridden, but the compiler still emits a separate call to FooPlus::FooPlus()
(passing it a pointer to the storage from new
). So this just seems to be a cast of clang not taking advantage in Case 2 and maybe also Case 1.