Why does the cost of C++ virtual call depend on the number of derived classes?

Question

EDIT: At the request of n.m., I have included the complete code I was using despite its verbosity.

I've written a short example program that I was using to study the overhead of virtual calls.

timer.h:

#include <chrono>
#include <cstdint>

class Timer {
    typedef std::chrono::high_resolution_clock clock;
    typedef clock::time_point time_point;
    typedef clock::duration duration;

    time_point begin;
    time_point end;
public:
    void start()
    {
        begin = clock::now();
    }

    void stop()
    {
        end = clock::now();
    }

    double as_seconds()
    {
        duration dur = end - begin;

        return (double) dur.count() * (double) duration::period::num / (double) duration::period::den;
    }
};

virtual.h:

#pragma once

static const int NUM_DERIVED = 10;

struct Base {
    int val;

#ifdef NO_VIRTUAL
  #define virtual
#endif
    virtual void f();
    virtual ~Base();
#undef virtual
};

Base *create_base(unsigned max_derived);

virtual.cpp:

#include <algorithm>
#include <iostream>
#include <memory>
#include <typeinfo>
#include <vector>
#include "virtual.h"
#include "timer.h"

template <typename Container>
void measure_call(const Container &c)
{
    Timer t;

    t.start();
    for (auto &ptr : c) {
        ptr->f();
    }
    t.stop();

    std::cout << "Virtual calls: " << c.size() << '\n';
    std::cout << "Elapsed time (s): " << t.as_seconds() << '\n';
}

int main()
{
    typedef std::unique_ptr<Base> base_ptr;
    typedef std::vector<base_ptr> base_vector;

    const int NUM_VEC = 10000000;
    const int NUM_DERIVED_USED = NUM_DERIVED;

    base_vector vec1;
    base_vector vec2;

    Timer t;

    vec1.reserve(NUM_VEC);
    vec2.reserve(NUM_VEC);
    for (int i = 0; i < NUM_VEC; ++i) {
        Base *b1 = create_base(1);
        Base *b2 = create_base(NUM_DERIVED_USED);
        vec1.emplace_back(b1);
        vec2.emplace_back(b2);
    }

    std::cout << "Measuring random vector of " << 1 << " derived classes\n";
    measure_call(vec1);
    std::cout << '\n';

    std::cout << "Measuring random vector of " << NUM_DERIVED_USED << " derived classes\n";
    measure_call(vec2);
    std::cout << '\n';

    std::cout << "Sorted vector of " << NUM_DERIVED << " derived clases\n";
    std::sort(
        vec2.begin(), vec2.end(),
        [](const base_ptr &a, const base_ptr &b) -> bool
        {
            return typeid(*a).hash_code() < typeid(*b).hash_code();
        }
    );
    for (int i = 0; i < 20; ++i) {
        int idx = vec2.size() / 20 * i;
        std::cout << "vector[" << idx << "]: " << typeid(*vec2[idx]).name() << '\n';
    }
    std::cout << '\n';

    std::cout << "Measuring sorted vector of " << NUM_DERIVED_USED << " derived classes\n";
    measure_call(vec2);

    return 0;
}

virtual_impl.cpp:

#include <cstdlib>
#include "virtual.h"

template <int N>
struct Derived : Base {
    void f() { Base::val = N; }
    ~Derived() {}
};

void Base::f() {}

Base::~Base() {}

Base *create_base(unsigned max_derived)
{
    unsigned n = max_derived > NUM_DERIVED ? NUM_DERIVED : max_derived;

    switch (rand() % n) {
    case 9:
        return new Derived<9>;
    case 8:
        return new Derived<8>;
    case 7:
        return new Derived<7>;
    case 6:
        return new Derived<6>;
    case 5:
        return new Derived<5>;
    case 4:
        return new Derived<4>;
    case 3:
        return new Derived<3>;
    case 2:
        return new Derived<2>;
    case 1:
        return new Derived<1>;
    case 0:
        return new Derived<0>;
    default:
        return nullptr;
    }
}

I compile virtual_impl.cpp into a shared library to prevent the compiler from messing around and devirtualizing or inlining things:

$ g++ -shared --std=c++11 -O3 virtual_impl.cpp -o libvirtual_impl.so
$ g++ --std=c++11 -O3 virtual.cpp -L. -lvirtual_impl -o virtual
$ ./virtual

The result I get is:

Measuring random vector of 1 derived classes
Virtual calls: 10000000
Elapsed time (s): 0.039333

Measuring random vector of 10 derived classes
Virtual calls: 10000000
Elapsed time (s): 0.089604

Sorted vector of 10 derived clases
vector[0]: 7DerivedILi8EE
vector[500000]: 7DerivedILi8EE
vector[1000000]: 7DerivedILi8EE
vector[1500000]: 7DerivedILi7EE
vector[2000000]: 7DerivedILi7EE
vector[2500000]: 7DerivedILi2EE
vector[3000000]: 7DerivedILi2EE
vector[3500000]: 7DerivedILi9EE
vector[4000000]: 7DerivedILi9EE
vector[4500000]: 7DerivedILi0EE
vector[5000000]: 7DerivedILi1EE
vector[5500000]: 7DerivedILi1EE
vector[6000000]: 7DerivedILi1EE
vector[6500000]: 7DerivedILi6EE
vector[7000000]: 7DerivedILi6EE
vector[7500000]: 7DerivedILi4EE
vector[8000000]: 7DerivedILi4EE
vector[8500000]: 7DerivedILi3EE
vector[9000000]: 7DerivedILi5EE
vector[9500000]: 7DerivedILi5EE

Measuring sorted vector of 10 derived classes
Virtual calls: 10000000
Elapsed time (s): 0.115058

As you can see, with 10 derived classes, the cost is nearly 3x greater than with only one derived class. Branch prediction seems like a likely target, but somehow on a list sorted by type, the performance is even worse than the random list!

EDIT2: There doesn't seem to be any black magic occurring in the code generation. I found the disassembly for measure_call here:

(gdb) disassemble
Dump of assembler code for function _Z12measure_callISt6vectorISt10unique_ptrI4BaseSt14default_deleteIS2_EESaIS5_EEEvRKT_:
0x0000000100002170: push   r14
0x0000000100002172: push   r13
0x0000000100002174: push   r12
0x0000000100002176: mov    r12,rdi
0x0000000100002179: push   rbp
0x000000010000217a: push   rbx
0x000000010000217b: sub    rsp,0x10
0x000000010000217f: call   0x10000283e <dyld_stub__ZNSt6chrono3_V212system_clock3nowEv>
0x0000000100002184: mov    rbp,QWORD PTR [r12+0x8]
0x0000000100002189: mov    rbx,QWORD PTR [r12]
0x000000010000218d: mov    r13,rax
0x0000000100002190: cmp    rbx,rbp
0x0000000100002193: je     0x1000021b1 <_Z12measure_callISt6vectorISt10unique_ptrI4BaseSt14default_deleteIS2_EESaIS5_EEEvRKT_+65>
0x0000000100002195: nop    DWORD PTR [rax+rax+0x0]
0x000000010000219a: nop    WORD PTR [rax+rax+0x0]
0x00000001000021a0: mov    rdi,QWORD PTR [rbx]
0x00000001000021a3: add    rbx,0x8
0x00000001000021a7: mov    rcx,QWORD PTR [rdi]
0x00000001000021aa: call   QWORD PTR [rcx]
0x00000001000021ac: cmp    rbp,rbx
0x00000001000021af: jne    0x1000021a0 <_Z12measure_callISt6vectorISt10unique_ptrI4BaseSt14default_deleteIS2_EESaIS5_EEEvRKT_+48>
0x00000001000021b1: call   0x10000283e <dyld_stub__ZNSt6chrono3_V212system_clock3nowEv>
[iostream stuff follows]

It randomly allocates and returns a pointer to one of up to max_derived possible derived classes (definitions local to external library). — user2363399, Aug 21 '13 at 20:14
Look at the generated code. At a guess something is caching the result of the dynamic lookup - either the processor via branch prediction, or the generated code. (Your 0.09 worst case seems to translate to < 1ns penalty; do you really care?) — Alan Stokes, Aug 21 '13 at 20:16
Right, so without seeing what is going on in `create_base`, I can only guess what is going on. Alan's suggestions sounds like one possibility - the processor simply calls functions "less predictable". [And the difference between 0.036 and 0.093s is 0.057s - so about 0.6 ns per call - about one clock-cycle]. — Mats Petersson, Aug 21 '13 at 20:21
On the printout, I was erroneously multiplying the number of calls by 10, so that would actually be 6 ns/call. — user2363399, Aug 21 '13 at 20:24
Post your **entire** code such that it is a self-contained test. Seeing is a tad more productive than guessing. — n. m. could be an AI, Aug 21 '13 at 21:09
It might be possible if you have one derived class and never create an instance of the base directly that there's no virtual table use for the call. The compilation unit boundary may seem to rule that out, but IIRC some optimisers have run-time code that tests if an optimisation is applicable before choosing between the optimised and handles-all-cases versions - so long as the test can be done very efficiently, it may be a net gain in most cases. However, this does depend on a run-time test that's faster than calling through the virtual table or whatever the general-case code does. — , Aug 21 '13 at 21:34
Whether that applies or not, it's probably worth looking at the generated assembler. — , Aug 21 '13 at 21:35
I added the disassembly. It looks like a vanilla vtable call. — user2363399, Aug 21 '13 at 21:55

Chris Dodd · Answer 1 · 2013-08-21T21:04:13.890

1

You don't show the implementation of create_base, so I'm only guessing, but I think you're seeing the effect of branch prediction.

with create_base(1), all the pointers point at objects of the same derived class, so all the calls go to the same derived virtual function. So after the first one, the branch predictor correctly predicts the target address of the indirect branch and there is no branch delay.

With create_base(10), you're creating pointers to 10 different derived classes, each with its own derived virtual function. So every call is to a different address, causing the branch target predictor to miss much (90%?) of the time.

If you try using create_base(2), I think you'll see that its midway between the 1 and 10 cases (mispredict half the time), and larger values quickly approach the 10 case.

edited Aug 21 '13 at 21:04

answered Aug 21 '13 at 20:56

Chris Dodd

119,907
13
134
226

I thought of this on reading some of the comments to the question, but contrary to what I expected from reading http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array sorting by type actually reduced performance. This doesn't make sense to me. – user2363399 Aug 21 '13 at 21:05
@user: Because in your case the sort cost more than being sorted saved. Performance improvements often depend on specific cases. – Billy ONeal Aug 21 '13 at 21:23
If you read the fragment, the time to sort is not included in the measurement. – user2363399 Aug 21 '13 at 21:30

MSN · Accepted Answer · 2013-08-21T22:21:26.513

1

The issue is this:

void f() { Base::val = N; }

and this:

for (auto &ptr : c) {
    ptr->f();
}

Sorting by type randomizes which memory location you are reading from (to get the vtable address) assigning to (Base::val) which will dwarf branch prediction for 10 vtables.

edited Aug 21 '13 at 22:21

answered Aug 21 '13 at 22:05

MSN

53,214
7
75
105

This is an interesting thought, but `this` needs to be dereferenced to access the vtable anyway. Commenting out the body of `f()` didn't change the relative ordering of the timings. Now 1 derived class is 0.023s, 10 random is 0.084s, and 10 sorted is 0.092s. – user2363399 Aug 21 '13 at 22:14
@user2363399, I updated my answer, but the fundamentals are the same. You still have to dereference `ptr` to get the vtable address, which when sorted is effectively randomizing reads. – MSN Aug 21 '13 at 22:22
This appears to be the correct answer. When I created a sorted vector by first creating a series of `Derived<0>` followed by a series of `Derived<1>`, etc. the runtime was identical to the case of one base class. The answer was a matter of branch prediction miss vs cache miss. – user2363399 Aug 22 '13 at 01:26
can someone explain, why the memory location gets randomized? – user1767754 Mar 28 '14 at 21:51
@user1767754, calling `create_base(...)` repeatedly will very likely return memory that's close together. Sorting by vtable will randomize that memory (since the vtable itself is random) and make it non-contiguous. – MSN Mar 28 '14 at 22:40

Why does the cost of C++ virtual call depend on the number of derived classes?

2 Answers2