I am bench-marking some example functions on my processsor, each core running at 2 GHz. Here are the functions being bench-marked. Also, available on quick-bench
#include <stdlib.h>
#include <time.h>
#include <memory>
class Base
{
public:
virtual int addNumVirt( int x ) { return (i + x); }
int addNum( int x ) { return (x + i); }
virtual ~Base() {}
private:
uint32_t i{10};
};
class Derived : public Base
{
public:
// Overrides of virtual functions are always virtual
int addNumVirt( int x ) { return (x + i); }
int addNum( int x ) { return (x + i); }
private:
uint32_t i{20};
};
static void BM_nonVirtualFunc(benchmark::State &state)
{
srand(time(0));
volatile int x = rand();
std::unique_ptr<Derived> derived = std::make_unique<Derived>();
for (auto _ : state)
{
auto result = derived->addNum( x );
benchmark::DoNotOptimize(result);
}
}
BENCHMARK(BM_nonVirtualFunc);
static void BM_virtualFunc(benchmark::State &state)
{
srand(time(0));
volatile int x = rand();
std::unique_ptr<Base> derived = std::make_unique<Derived>();
for (auto _ : state)
{
auto result = derived->addNumVirt( x );
benchmark::DoNotOptimize(result);
}
}
BENCHMARK(BM_virtualFunc);
static void StringCreation(benchmark::State& state) {
// Code inside this loop is measured repeatedly
for (auto _ : state) {
std::string created_string("hello");
// Make sure the variable is not optimized away by compiler
benchmark::DoNotOptimize(created_string);
}
}
// Register the function as a benchmark
BENCHMARK(StringCreation);
static void StringCopy(benchmark::State& state) {
// Code before the loop is not measured
std::string x = "hello";
for (auto _ : state) {
std::string copy(x);
}
}
BENCHMARK(StringCopy);
Below are the Google-benchmark results.
Run on (64 X 2000 MHz CPU s)
CPU Caches:
L1 Data 32K (x32)
L1 Instruction 64K (x32)
L2 Unified 512K (x32)
L3 Unified 8192K (x8)
Load Average: 0.08, 0.04, 0.00
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_nonVirtualFunc 0.490 ns 0.490 ns 1000000000
BM_virtualFunc 0.858 ns 0.858 ns 825026009
StringCreation 2.74 ns 2.74 ns 253578500
BM_StringCopy 5.24 ns 5.24 ns 132874574
The results show that the execution time is 0.490 ns
and 0.858 ns
for the first two functions.
However, what I do not understand is if my core is running at 2 GHz, this means one cycle is 0.5 ns
, which makes the result seem unreasonable.
I know that the result shown is an average over the number of iterations. And such low execution time means that most of the samples are below 0.5 ns
.
What am I missing?
Edit 1:
From the comments, it seems like adding a constant i
to x
was not a good idea. In fact, I started with calling std::cout
in the virtual and non-virtual functions. This helped me in understanding that virtual functions are not inlined and the call needs to be resolved at run-time.
However, having outputs in the functions being bench-marked does not look nice on the terminal. (Is there a way to share my code from Godbolt?) Can anyone propose an alternative to printing something inside the function?