How to measure the call overhead of functions?

Question

I wanted to measure and compare the overhead of different function calls. Different in the sense of two alternative ways to deal with extending the class while minimizing code modification:

using an abstract base class and providing implementations in virtual member functions
using a policy host class and defining different policies with static and member functions

Both those options are compared with calling no function at all. I am also aware of the NVI idiom usually used when designing classes that support dynamic polymorphism - the example that I used was just a benchmark for the overhead.

Here is the code that I tried to use for this purpose:

#include <iostream>
#include <vector>
#include <chrono>
#include <ctime>
#include <memory>

class Interface 
{
    public:
        virtual double calculate(double t) = 0; 
        virtual ~Interface() = default;

};

class Square
: 
    public Interface
{
    public:

       double calculate(double d)
       {
           return d*d;
       }

};

class SquareStaticFunction
{
    public:
        static double calculate(double d)
        {
            return d*d; 
        }
};

class SquareMemberFunction
{
    public:
        double calculate(double d)
        {
            return d*d; 
        }
};

template<typename Function>
class Generic
:
    public Function
{
    public:
        using Function::calculate;
};

using namespace std;

int main(int argc, const char *argv[])
{
    vector<double> test(1e06, 5); 

    unique_ptr<Interface> sUptr(new Square());

    Interface* sPtr = new Square(); 

    Generic<SquareStaticFunction> gStatic; 
    Generic<SquareMemberFunction> gMember; 

    double result;

    typedef std::chrono::high_resolution_clock Clock; 

    auto start = Clock::now();
    for (auto d : test)
    {
        result = d * d;  
    }
    auto end = Clock::now(); 

    auto noFunction = end - start; 

    start = Clock::now();  

    for (auto d : test)
    {
        result = sUptr->calculate(d);
    }
    end = Clock::now();  

    auto virtualMemberFunction = end - start; 

    start = Clock::now();  

    for (auto d : test)
    {
        result = sPtr->calculate(d);
    }
    end = Clock::now();  

    auto virtualMemberFunctionRaw = end - start; 

    start = Clock::now();
    for (auto d : test)
    {
        result = gStatic.calculate(d);  
    }
    end = Clock::now(); 

    auto staticPolicy = end - start; 

    start = Clock::now();
    for (auto d : test)
    {
        result = gMember.calculate(d);  
    }
    end = Clock::now(); 

    auto memberPolicy = end - start; 

    cout << noFunction.count() << " " << virtualMemberFunction.count() 
        << " " << virtualMemberFunctionRaw.count() 
        << " " << staticPolicy.count() 
        << " " << memberPolicy.count() << endl;

    delete sPtr; 
    sPtr = nullptr;

    return 0;
}

I compiled the code using gcc 4.8.2, and on a Linux x86_64 machine, with the following CPU model: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz.

The virtual member function is accessed in one test via a raw pointer, and in another one via a unique_ptr. First I have compiled the code without any optimizations:

g++ -std=c++11 main.cpp -o main

and ran 1000 tests with the following shell command:

for i in {1..1000}; do ./main >> results; done

The results file I have plotted using the following gnuplot script (note the logarithmic y-axis):

set terminal png size 1600,800
set logscale y 
set key out vert right top
set out 'results.png' 
plot 'results' using 0:1 title "no function" , \
     'results' using 0:2 title "virtual member function (unique ptr)", \
     'results' using 0:3 title "virtual member function (raw ptr)", \
     'results' using 0:4 title "static policy", \
     'results' using 0:5 title 'member function policy'

For the non-optimized code, the diagram looks like this:

Non-optimized function call overhead.

Q1 Does the call to the virtual function via the unique_ptr end up being the most expensive one because it involves a redirection when dereferencing the pointer to the managed object?

Then I turned on optimization and compiled the code with:

g++ -std=c++11 -O3 main.cpp -o main

which resulted in the following diagram:

enter image description here

Q2: Are the virtual member functions most costly in this case because, when accessed via the base class pointer or reference (the vtable dispatch is turned on), it is not possible for the compiler to make them inline?

Q3: This question is what made me post all this: how is is possible in the optimized diagram that the static and member policies end up being faster than rolled out code for this simple example?

Edit: making the result volatile and compiling with optimizations turned on, makes the run times of the policies much larger, but they are similar to the raw multiplication code:

enter image description here

Edit modifying the code so that the result is added to instead of assigned (proposed by dyk in comments) without using volatile:

result += ...

results with the same diagram as for the original code.

@dyp: when I place the test with rolled out multiplication anywhere else but at the beginning, the measured times overlap more/less with the policy class code. do you know why this happens? — tmaric, Mar 10 '14 at 14:01
I'm not entirely sure, but I guessed that would happen :) Maybe it's related to caching. — dyp, Mar 10 '14 at 14:02
Also, you could try `-march=native` and `-fwhole-program` and output the `result` somewhere; this should allow more aggressive optimizations. — dyp, Mar 10 '14 at 14:06
Did you do anything with `result` after playing around with it? Dead code optimization is relatively easy. Dead code whose side effects include print statements or return values, less so. — Yakk - Adam Nevraumont, Mar 10 '14 at 18:41
Unless yor program is calling millions of microscopic methods this can't possibly be a performance bottleneck. Don't FUBAR the code to gain a 1% of performance. — vonbrand, Mar 18 '14 at 00:36
@vonbrand, yep, my code is calling hundreds of millions of microscopic methods. :) — tmaric, Mar 18 '14 at 14:43
@tmaric, very interesting, can you post the graph for the last revision please? — Surt, Oct 08 '14 at 10:37

Antoine · Accepted Answer · 2014-03-10T17:41:31.050

Looking at the disassembly of -O3 -march=native -std=c++11 with your code showed that the compiler was doing "too much" optimization by detecting the unnecessary re-affectation to the same unused variable. As suggested in comments, I used += instead of =. I also initialized result = 0 and main returns result instead of 0 to be make sure the compiler computes its value. This modified code gives:

noFunction, staticPolicy and memberPolicy is lowered as mulsd, addsd, addsd, i.e. scalar SSE instruction. Clang also doesn't vectorize (with vanilla options), but Intel's icc does (it generates vector and non vector versions and jumps depending on alignment and iteration count).
virtualMemberFunction and virtualMemberFunctionRaw result in a dynamic function call (no de-virtualization and inlining)

You can see for yourself by pasting your code here.

To answer your Q1 "pointer vs unique_ptr in debug build" : in -O0 calls are not inlined automatically, in particular unique_ptr::operator-> is called explicitly with no inlining, so that's 2 function call per iteration instead of 1 for regular pointers. This difference disappears for optimized builds

To answer your Q2 "is it possible to inline virtual calls" : in this example, gcc and clang don't inline the call, because they probably don't do enough static analysis. But you can help them. For instance, with clang 3.3 (but not 3.2 and not gcc) declaring the method as const and __attribute((pure)) does the job. In gcc (4.8, pre-4.9) I tried marking the method as final and compile with -fwhole-program but this didn't remove the call. So yes in this specific case it is possible to de-virtualize, but not reliably. In general, jitted compilers (C#, Java) are better at de-virtualizing because they can make better assumption from runtime information.

*"Jitted compilers (C#, Java) are better at optimizing this specific case."* Please try `-fwhole-program` passing also, and using `final` before jumping to this conclusion. — Ali, Mar 10 '14 at 16:20
@Ali: by this specific case, I meant *virtual calls*, not only when a method has only one implementation which is a very specific situation. In general, jitters have access to dynamic+static info so in theory they can make more accurate assumptions than static compilers. In practice it depends on your compiler implementation ;) — Antoine, Mar 10 '14 at 16:31
@Ali: I tested with `final` and `whole-program` and edited my answer (no inlining, but **clang 3.3 inlines with `pure`**), thanks for your feedback. — Antoine, Mar 10 '14 at 17:14

How to measure the call overhead of functions?

1 Answers1