about the cost of virtual function

Question

If I call a virtual function 1000 times in a loop, will I suffer from the vtable lookup overhead 1000 times or only once?

I'd be more concerned about cache miss and badly predicted jump target than the time needed for look up per se. And those problems dependent a lot on the precise patterns of previous accesses. Standard response about micro-performance comparison was "measure" and now it is "measure and pay attention that what you measure is really characteristic of deployment". You'll also get dependence on the precise model of processor, what the other core(s) are running, SMT effects, ... — AProgrammer, Aug 18 '09 at 09:23

score 8 · Answer 1 · answered Aug 18 '09 at 09:07

The compiler may be able to optimise it - for example, the following is (at least conceptually) easliy optimised:

Foo * f = new Foo;
for ( int i = 0; i < 1000; i++ ) {
   f->func();
}

However, other cases are more difficult:

vector <Foo *> v;
// populate v with 1000 Foo (not derived) objects
for ( int i = 0; i < v.size(); i++ ) {
   v[i]->func();
}

the same conceptual optimisation is applicable, but much harder for the compiler to see.

Bottom line - if you really care about it, compile your code with all optimisations enabled and examine the compiler's assembler output.

Sam Harwell · Accepted Answer · 2009-08-18T09:19:05.593

The Visual C++ compiler (at least through VS 2008) does not cache vtable lookups. Even more interestingly, it doesn't direct-dispatch calls to virtual methods where the static type of the object is sealed. However, the actual overhead of the virtual dispatch lookup is almost always negligible. The place where you sometimes do see a hit is in the fact that virtual calls in C++ cannot be replaced by direct calls like they can in a managed VM. This also means no inlining for virtual calls.

The only true way to establish the impact for your application is using a profiler.

Regarding the specifics of your original question: if the virtual method you are calling is trivial enough that the virtual dispatch itself is incurring a measurable performance impact, then that method is sufficiently small that the vtable will remain in the processor's cache throughout the loop. Even though the assembly instructions to pull the function pointer from the vtable are executed 1000 times, the performance impact will be much less than (1000 * time to load vtable from system memory).

Thank you very much for your answer and comments. I'll check that on gcc when I get some time. — fulmicoton, Aug 18 '09 at 12:27

score 3 · Answer 3 · answered Aug 18 '09 at 09:06

3

If the compiler can deduce that the object on which you're calling the virtual function doesn't change, then, in theory, it should be able to hoist the vtable lookup out of the loop.

Whether your particular compiler actually does this is something you can only find out by looking at the assembly code it produces.

answered Aug 18 '09 at 09:06

Martin B

23,670
6
53
72

... or by profiling. Write a code that must do 1000 lookups and compare. – Tadeusz A. Kadłubowski Aug 18 '09 at 09:07
But what would you compare it against? You can't compare it against a non-virtual function because that calls an absolute address as opposed to an indirect address. You also can't compare it against code that calls 1000 different objects because a) you have to get the addresses of those objects from somewhere, which takes extra time, and b) calling 1000 different objects is much less cache-friendly, so we would expect it to be slower anyway. – Martin B Aug 18 '09 at 09:19
2

Compared against a volatie Foo*. Note that you first have to check 1000 non-virtual calls to see how much overhead you incur by reloading `this` on every call. Then, compare volatile and non-volatile Foo*'s over 1000 virtual calls to see how much _additional_ overhead the vtable lookup incurs. – MSalters Aug 18 '09 at 10:46

score 1 · Answer 4 · answered Aug 18 '09 at 09:37

I think that the problem is not vtable lookup since that's very fast operation especially in a loop where you have all required values on cache (if the loop is not too complex, but if it's complex then virtual function wouldn't impact performance a lot). The problem is the fact that compiler cannot inline that function in compile time.

This is especially a problem when virtual function is very small (e.g. returning only one value). The relative performance impact in this case can be huge because you need function call to just return a value. If this function can be inlined, it would improve performance very much.

If the virtual function is performance consuming, then I wouldn't really care about vtable.

score 1 · Answer 5 · answered Oct 23 '13 at 20:09

1

For a study about the overhead of Virtual Function Calls i recommend the paper "The Direct Cost of Virtual Function Calls in C++"

answered Oct 23 '13 at 20:09

Quonux

2,975
1
24
32

WaltK · Answer 6 · 2016-10-28T15:10:40.807

Let's give it a try with g++ targeting x86:

$ cat y.cpp
struct A
  {
    virtual void not_used(int);
    virtual void f(int);
  };

void foo(A &a)
  {
    for (unsigned i = 0; i < 1000; ++i)
      a.f(13);
  }
$ 
$ gcc -S -O3  y.cpp  # assembler output, max optimization
$ 
$ cat y.s
    .file   "y.cpp"
    .section    .text.unlikely,"ax",@progbits
.LCOLDB0:
    .text
.LHOTB0:
    .p2align 4,,15
    .globl  _Z3fooR1A
    .type   _Z3fooR1A, @function
_Z3fooR1A:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    pushq   %rbx
    .cfi_def_cfa_offset 24
    .cfi_offset 3, -24
    movq    %rdi, %rbp
    movl    $1000, %ebx
    subq    $8, %rsp
    .cfi_def_cfa_offset 32
    .p2align 4,,10
    .p2align 3
.L2:
    movq    0(%rbp), %rax
    movl    $13, %esi
    movq    %rbp, %rdi
    call    *8(%rax)
    subl    $1, %ebx
    jne .L2
    addq    $8, %rsp
    .cfi_def_cfa_offset 24
    popq    %rbx
    .cfi_def_cfa_offset 16
    popq    %rbp
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc
.LFE0:
    .size   _Z3fooR1A, .-_Z3fooR1A
    .section    .text.unlikely
.LCOLDE0:
    .text
.LHOTE0:
    .ident  "GCC: (GNU) 5.3.1 20160406 (Red Hat 5.3.1-6)"
    .section    .note.GNU-stack,"",@progbits
$

The L2 label is the top of the loop. The line right after L2 seems to be loading the vpointer into rax. The call 4 lines after L2 seems to be indirect, fetching the pointer to the f() override from the vstruct.

I'm surprised by this. I would have expected the compiler to treat the address of the f() override function as a loop invariant. It seems like gcc is making two "paranoid" assumptions:

The f() override function may change the hidden vpointer in the object somehow, or
The f() override function may change the contents of the vstruct somehow.

Edit: In a separate compilation unit, I implemented A::f() and a main function with a call to foo(). I then built an executable with gcc using link-time optimization, and ran objdump on it. The virtual function call was inlined. So, perhaps this is why gcc optimization without LTO is not as ideal as one might expect.

score 0 · Answer 7 · answered Aug 18 '09 at 09:12

0

I would say this depends on your compiler as well as on the look of the loop. Optimizing compilers can do a lot for you and if the VF-call is predictable the compiler can help you. Maybe you can find something about the optimizations your compiler does in your compiler documentation.

answered Aug 18 '09 at 09:12

ManniAT

1,989
2
19
25

1

I know it "could", I don't if it "does" help me. – fulmicoton Aug 18 '09 at 12:29

about the cost of virtual function

7 Answers7

Linked