I wanted to measure and compare the overhead of different function calls. Different in the sense of two alternative ways to deal with extending the class while minimizing code modification:
- using an abstract base class and providing implementations in virtual member functions
- using a policy host class and defining different policies with static and member functions
Both those options are compared with calling no function at all. I am also aware of the NVI idiom usually used when designing classes that support dynamic polymorphism - the example that I used was just a benchmark for the overhead.
Here is the code that I tried to use for this purpose:
#include <iostream>
#include <vector>
#include <chrono>
#include <ctime>
#include <memory>
class Interface
{
public:
virtual double calculate(double t) = 0;
virtual ~Interface() = default;
};
class Square
:
public Interface
{
public:
double calculate(double d)
{
return d*d;
}
};
class SquareStaticFunction
{
public:
static double calculate(double d)
{
return d*d;
}
};
class SquareMemberFunction
{
public:
double calculate(double d)
{
return d*d;
}
};
template<typename Function>
class Generic
:
public Function
{
public:
using Function::calculate;
};
using namespace std;
int main(int argc, const char *argv[])
{
vector<double> test(1e06, 5);
unique_ptr<Interface> sUptr(new Square());
Interface* sPtr = new Square();
Generic<SquareStaticFunction> gStatic;
Generic<SquareMemberFunction> gMember;
double result;
typedef std::chrono::high_resolution_clock Clock;
auto start = Clock::now();
for (auto d : test)
{
result = d * d;
}
auto end = Clock::now();
auto noFunction = end - start;
start = Clock::now();
for (auto d : test)
{
result = sUptr->calculate(d);
}
end = Clock::now();
auto virtualMemberFunction = end - start;
start = Clock::now();
for (auto d : test)
{
result = sPtr->calculate(d);
}
end = Clock::now();
auto virtualMemberFunctionRaw = end - start;
start = Clock::now();
for (auto d : test)
{
result = gStatic.calculate(d);
}
end = Clock::now();
auto staticPolicy = end - start;
start = Clock::now();
for (auto d : test)
{
result = gMember.calculate(d);
}
end = Clock::now();
auto memberPolicy = end - start;
cout << noFunction.count() << " " << virtualMemberFunction.count()
<< " " << virtualMemberFunctionRaw.count()
<< " " << staticPolicy.count()
<< " " << memberPolicy.count() << endl;
delete sPtr;
sPtr = nullptr;
return 0;
}
I compiled the code using gcc 4.8.2, and on a Linux x86_64 machine, with the following CPU model: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz.
The virtual member function is accessed in one test via a raw pointer, and in another one via a unique_ptr
. First I have compiled the code without any optimizations:
g++ -std=c++11 main.cpp -o main
and ran 1000 tests with the following shell command:
for i in {1..1000}; do ./main >> results; done
The results file I have plotted using the following gnuplot
script (note the logarithmic y-axis):
set terminal png size 1600,800
set logscale y
set key out vert right top
set out 'results.png'
plot 'results' using 0:1 title "no function" , \
'results' using 0:2 title "virtual member function (unique ptr)", \
'results' using 0:3 title "virtual member function (raw ptr)", \
'results' using 0:4 title "static policy", \
'results' using 0:5 title 'member function policy'
For the non-optimized code, the diagram looks like this:
Q1 Does the call to the virtual function via the unique_ptr
end up being the most expensive one because it involves a redirection when dereferencing the pointer to the managed object?
Then I turned on optimization and compiled the code with:
g++ -std=c++11 -O3 main.cpp -o main
which resulted in the following diagram:
Q2: Are the virtual member functions most costly in this case because, when accessed via the base class pointer or reference (the vtable dispatch is turned on), it is not possible for the compiler to make them inline?
Q3: This question is what made me post all this: how is is possible in the optimized diagram that the static and member policies end up being faster than rolled out code for this simple example?
Edit: making the result
volatile
and compiling with optimizations turned on, makes the run times of the policies much larger, but they are similar to the raw multiplication code:
Edit modifying the code so that the result is added to instead of assigned (proposed by dyk in comments) without using volatile
:
result += ...
results with the same diagram as for the original code.