Performance tests should (must) always be done with ful optimization and without debug code generation.
Doing performance comparisons in debug mode isn't trustable for multiple reasons (except you want to measure/compare the performance of debug code) because the compiler will generate extra code to ensure proper trackability of executed code in source code. Two examples to illustrate this:
1. No Inlining in Debug Code
From cppreference about inline
The original intent of the inline
keyword was to serve as an indicator to the optimizer that inline substitution of a function is preferred over function call, that is, instead of executing the function call CPU instruction to transfer control to the function body, a copy of the function body is executed without generating the call. This avoids overhead created by the function call (passing the arguments and retrieving the result) but it may result in a larger executable as the code for the function has to be repeated multiple times.
Since this meaning of the keyword inline
is non-binding, compilers are free to use inline substitution for any function that's not marked inline, and are free to generate function calls to any function marked inline. Those optimization choices do not change the rules regarding multiple definitions and shared statics listed above.
(I know this for sure in VS2013 and suspected this for other compilers also. Fiddling on godbolt told me that g++
seems to behave as well.)
To allow debugger commands for step-in and step-out, it seems to be necessary to supress inlined functions. Otherwise, it's probably hard to assign machine code addresses with line numbers in source code.
(Some years ago, I accidentally debugged C++ code (compiled with gcc
) where functions were inlined (due to probably wrong compiler settings). The debugger (gdb
) jumped in single step commands over multiple nested inline function calls not stopping before reaching code of most "inner" function. That was really painfully.)
Example:
#include <iostream>
int add(int a, int b) { return a + b; }
inline int sub(int a, int b) { return a - b; }
int main()
{
int a, b; std::cin >> a >> b;
std::cout << add(a, b);
std::cout << sub(a, b);
return 0;
}
Excerpt from generated code for g++ -std=c++17 -g
...:
; 10: std::cout << add(a, b);
mov edx, DWORD PTR [rbp-8]
mov eax, DWORD PTR [rbp-4]
mov esi, edx
mov edi, eax
call add(int, int)
mov esi, eax
mov edi, OFFSET FLAT:_ZSt4cout
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
; 11: std::cout << sub(a, b);
mov edx, DWORD PTR [rbp-8]
mov eax, DWORD PTR [rbp-4]
mov esi, edx
mov edi, eax
call sub(int, int)
mov esi, eax
mov edi, OFFSET FLAT:_ZSt4cout
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
Even with little knowledge of ASM, the call add(int, int)
and call sub(int, int)
are easy to recognize.
Excerpt from generated code for g++ -std=c++17 -O2
...:
; 10: std::cout << add(a, b);
mov esi, DWORD PTR [rsp+8]
mov edi, OFFSET FLAT:_ZSt4cout
add esi, DWORD PTR [rsp+12]
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
; 11: std::cout << sub(a, b);
mov esi, DWORD PTR [rsp+8]
mov edi, OFFSET FLAT:_ZSt4cout
sub esi, DWORD PTR [rsp+12]
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
Instead of calling functions add()
and sub()
, the compiler just inline both. The resp. commands are add esi, DWORD PTR [rsp+12]
and sub esi, DWORD PTR [rsp+12]
.
Full Example on godbolt
2. assert()
It is very usual (also in std library classes and functions) to enrich code with assert()
s which shall help to find implementation bugs in debug mode but are excluded in release mode to gain additional performance.
Possible implementation (from cppreference.com):
#ifdef NDEBUG
#define assert(condition) ((void)0)
#else
#define assert(condition) /*implementation defined*/
#endif
Hence, when code is compiled in debug mode, the performance measurement will measure all these assert
expressions as well. Usually, this is considered as distorted result.
Example:
#include <cassert>
#include <iostream>
int main()
{
int n; std::cin >> n;
assert(n > 0); // (stupid idea to assert user input)
std::cout << n;
return 0;
}
Excerpt from generated code for g++ -std=c++17 -D_DEBUG
...:
; 6: int n; std::cin >> n;
lea rax, [rbp-4]
mov rsi, rax
mov edi, OFFSET FLAT:_ZSt3cin
call std::basic_istream<char, std::char_traits<char> >::operator>>(int&)
; 7: assert(n > 0); // (stupid idea to assert user input)
mov eax, DWORD PTR [rbp-4]
test eax, eax
jg .L2
mov ecx, OFFSET FLAT:main::__PRETTY_FUNCTION__
mov edx, 7
mov esi, OFFSET FLAT:.LC0
mov edi, OFFSET FLAT:.LC1
call __assert_fail
.L2:
; 8: std::cout << n;
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:_ZSt4cout
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
Excerpt from generated code for g++ -std=c++17 -DNDEBUG
...:
; 6: int n; std::cin >> n;
lea rax, [rbp-4]
mov rsi, rax
mov edi, OFFSET FLAT:_ZSt3cin
call std::basic_istream<char, std::char_traits<char> >::operator>>(int&)
; 8: std::cout << n;
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:_ZSt4cout
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
Full Example on godbolt
Trying to Reproduce the Claim of OP...
Disclaimer: I'm not a professional concerning benchmarking. If you want to get another opinion, this might be interesting: CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!".
Nevertheless, I see some weaknesses in the tests of OP.
Both kinds of test should happen in the same code (to be testable in as near as comparable conditions).
Before testing CPU speed, it is recommended to make a "Warm up".
Modern compilers make dataflow analysis. To be sure, that relevant code isn't optimized away, side effects have to be carefully put in. On the other hand, these side effects may weaken the measurement.
Modern compilers are clever to compute as much as possible during compile time. To prevent this, I/O should be used for initialization of relevant parameters.
Measurement in general should be repeated to average measurement errors. (This is something I learnt decades ago in Physics lessons when I was a school boy.)
These things in mind, I modified the exposed sample code of OP a bit:
#include <iostream>
#include <iomanip>
#include <vector>
#include <sys/time.h>
class Number {
protected:
long _val;
public:
Number(long n): _val(n) { }
operator long() const;
Number operator ++(int);
Number& operator ++();
bool operator < (long n) const;
};
Number :: operator long() const { return _val; }
Number Number :: operator ++(int) { return Number(_val++); }
Number& Number :: operator ++() { ++_val; return *this; }
bool Number :: operator < (long n) const { return _val < n; }
class NumberInline {
protected:
long _val;
public:
NumberInline(long n): _val(n) { }
operator long() const { return _val; }
NumberInline operator ++(int) { return NumberInline(_val++); }
NumberInline& operator ++() { ++_val; return *this; }
bool operator < (long n) const { return _val < n; }
};
long microsec(timeval &t) { return t.tv_sec * 1000000L + t.tv_usec; }
int main()
{
timeval t1, t2, t3;
// heat-up of CPU
std::cout << "Heating up...\n";
gettimeofday(&t1, nullptr);
Number n(0), nMax(0);
do {
++n; ++nMax;
gettimeofday(&t2, nullptr);
} while (microsec(t2) < microsec(t1) + 3000000L /* 3s */);
// do experiment
std::cout << "Starting experiment...\n";
int nExp; if (!(std::cin >> nExp)) return 1;
long max; if (!(std::cin >> max)) return 1;
std::vector<std::pair<long, long>> t;
for (int i = 0; i < nExp; ++i) {
Number n = 0; NumberInline nI = 0;
gettimeofday(&t1, nullptr);
for (n = 0; n < max; ++n);
gettimeofday(&t2, nullptr);
for (nI = 0; nI < max; ++nI);
gettimeofday(&t3, nullptr);
std::cout << "n: " << n << ", nI: " << nI << '\n';
t.push_back(std::make_pair(
microsec(t2) - microsec(t1),
microsec(t3) - microsec(t2)));
std::cout << "t[" << i << "]: { "
<< std::setw(10) << t[i].first << "us, "
<< std::setw(10) << t[i].second << "us }\n";
}
double tAvg0 = 0.0, tAvg1 = 0.0;
for (const std::pair<long, long> &tI : t) {
tAvg0 += tI.first; tAvg1 += tI.second;
}
tAvg0 /= nExp; tAvg1 /= nExp;
std::cout << "Average times: " << std::fixed
<< tAvg0 << "us, " << tAvg1 << "us\n";
std::cout << "Ratio: " << tAvg0 / tAvg1 << "\n";
return 0;
}
I compiled and tested this in cygwin64 on Windows 10 (64 bit):
$ g++ --version
g++ (GCC) 7.3.0
$ echo "10 1000000000" | (g++ -std=c++11 -O0 testInline.cc -o testInline && ./testInline)
Heating up...
Starting experiment...
n: 1000000000, nI: 1000000000
t[0]: { 4811515us, 4579710us }
n: 1000000000, nI: 1000000000
t[1]: { 4703022us, 4649293us }
n: 1000000000, nI: 1000000000
t[2]: { 4725413us, 4724408us }
n: 1000000000, nI: 1000000000
t[3]: { 4777736us, 4744561us }
n: 1000000000, nI: 1000000000
t[4]: { 4807298us, 4831872us }
n: 1000000000, nI: 1000000000
t[5]: { 4853159us, 4616783us }
n: 1000000000, nI: 1000000000
t[6]: { 4818285us, 4769500us }
n: 1000000000, nI: 1000000000
t[7]: { 4753801us, 4693287us }
n: 1000000000, nI: 1000000000
t[8]: { 4781828us, 4439588us }
n: 1000000000, nI: 1000000000
t[9]: { 4125942us, 4090368us }
Average times: 4715799.900000us, 4613937.000000us
Ratio: 1.022077
$ echo "10 1000000000" | (g++ -std=c++11 -O1 testInline.cc -o testInline && ./testInline)
Heating up...
Starting experiment...
n: 1000000000, nI: 1000000000
t[0]: { 395756us, 381372us }
n: 1000000000, nI: 1000000000
t[1]: { 410973us, 395130us }
n: 1000000000, nI: 1000000000
t[2]: { 383708us, 376009us }
n: 1000000000, nI: 1000000000
t[3]: { 399632us, 373718us }
n: 1000000000, nI: 1000000000
t[4]: { 362056us, 398840us }
n: 1000000000, nI: 1000000000
t[5]: { 370812us, 397596us }
n: 1000000000, nI: 1000000000
t[6]: { 381679us, 392219us }
n: 1000000000, nI: 1000000000
t[7]: { 371318us, 396928us }
n: 1000000000, nI: 1000000000
t[8]: { 404398us, 433730us }
n: 1000000000, nI: 1000000000
t[9]: { 370402us, 356458us }
Average times: 385073.400000us, 390200.000000us
Ratio: 0.986862
$ echo "10 1000000000" | (g++ -std=c++11 -O2 testInline.cc -o testInline && ./testInline)
Heating up...
Starting experiment...
n: 1000000000, nI: 1000000000
t[0]: { 1us, 0us }
n: 1000000000, nI: 1000000000
t[1]: { 0us, 0us }
n: 1000000000, nI: 1000000000
t[2]: { 0us, 0us }
n: 1000000000, nI: 1000000000
t[3]: { 0us, 0us }
n: 1000000000, nI: 1000000000
t[4]: { 0us, 0us }
n: 1000000000, nI: 1000000000
t[5]: { 0us, 0us }
n: 1000000000, nI: 1000000000
t[6]: { 0us, 0us }
n: 1000000000, nI: 1000000000
t[7]: { 0us, 0us }
n: 1000000000, nI: 1000000000
t[8]: { 0us, 0us }
n: 1000000000, nI: 1000000000
t[9]: { 0us, 1us }
Average times: 0.100000us, 0.100000us
Ratio: 1.000000
$
Concerning the last test (with -O2
), I have serious doubts whether the code in quest is executed at runtime. I loaded the file into godbolt (same g++ version, same options) to get a clue. I had serious problems to interprete the generated code but here it is: Sample in Compiler Explorer.
Finally, the Ratios of first two experiments (close to 1) tell me
Differences are very probably mere measurement noise.
I cannot reproduce the statement of OP that inlined code runs with significantly different time than non-inlined.
(That beside of the serious doubts which have already been told by comments and in FalcoGer's answer.)