I am interested in the execution speed of a function called through a pointer. I found initially that calling a function pointer through a pointer passed in as a parameter is slower than calling a locally declared function pointer. Please see the following code; you can see I have two function calls, both of which ultimately execute a lambda through a function pointer.
#include <chrono>
#include <iostream>
using namespace std;
__attribute__((noinline)) int plus_one(int x) {
return x + 1;
}
typedef int (*FUNC)(int);
#define OUTPUT_TIME(msg) std::cout << "Execution time (ns) of " << msg << ": " << std::chrono::duration_cast<chrono::nanoseconds>(t_end - t_start).count() << std::endl;
#define START_TIMING() auto const t_start = std::chrono::high_resolution_clock::now();
#define END_TIMING(msg) auto const t_end = std::chrono::high_resolution_clock::now(); OUTPUT_TIME(msg);
auto constexpr g_count = 1000000;
__attribute__((noinline)) int speed_test_no_param() {
int r;
auto local_lambda = [](int a) {
return plus_one(a);
};
FUNC f = local_lambda;
START_TIMING();
for (auto i = 0; i < g_count; ++i)
r = f(100);
END_TIMING("speed_test_no_param");
return r;
}
__attribute__((noinline)) int speed_test_with_param(FUNC &f) {
int r;
START_TIMING();
for (auto i = 0; i < g_count; ++i)
r = f(100);
END_TIMING("speed_test_with_param");
return r;
}
int main() {
int ret = 0;
auto main_lambda = [](int a) {
return plus_one(a);
};
ret += speed_test_no_param();
FUNC fp = main_lambda;
ret += speed_test_with_param(fp);
return ret;
}
Built on Ubuntu 20.04 with:
g++ -ggdb -ffunction-sections -O3 -std=c++17 -DNDEBUG=1 -DRELEASE=1 -c speed_test.cpp -o speed_test.o && g++ -o speed_test -Wl,-gc-sections -Wl,--start-group speed_test.o -Wl,--rpath='$ORIGIN' -Wl,--end-group
The results were not surprising; for any given number of runs, we see that the version without the parameter is clearly the fastest. Here is just one run; all of the many times I have run, this yields the same result:
Execution time (ns) of speed_test_no_param: 74
Execution time (ns) of speed_test_with_param: 1173849
When I dig into the assembly, I found what I believe is the reason for this. The code for speed_test_no_param()
is:
0x000055555555534b call 0x555555555310 <plus_one(int)>
... whereas the code for speed_test_with_param
is more complicated; a fetch of the address of the lambda, then a jump to the plus_one
function:
0x000055555555544e call QWORD PTR [rbx]
...
0x0000555555555324 jmp 0x555555555310 <plus_one(int)>
(On compiler explorer at https://godbolt.org/z/b4hqYx7Eo. Different compiler but similar assembly; timing code commented out.)
What I didn't expect though is that when I reduce the number of calls down to 1 from 1000000 (auto constexpr g_count = 1
), the results are flipped with the parameter version being the fastest:
Execution time (ns) of speed_test_no_param: 61
Execution time (ns) of speed_test_with_param: 31
I have also run this many times; the parameter version is always the fastest.
I do not understand why this is; I don't now believe a call through a parameter is slower than a local variable due to this conflicting evidence, but looking at the assembly suggests it really should be.
Can someone please explain?
UPDATE
As per the comment below, ordering matters. When I call speed_test_with_param()
first, speed_test_no_param()
is the fastest of the two! Yet when I call speed_test_no_param()
first, speed_test_with_param()
is the fastest! Any explanation to this would be greatly appreciated!