I am seeing that specifying my class member function as inline actually increases the execution time, even though the function body is very small

Question

I have written a simple class and defined the function body for the member functions outside of the class definition. The function body size is very small (about a line). When I test for the performance, the performance seems to be dropping when I specify the function definition as inline.

This is my class and member function definitions.

#include<iostream>
#include<sys/time.h>

class number {
protected:
        long _val;
public:
        number(long n): _val(n) {}

        operator long() const;

        number operator ++(int);
        number& operator ++();

        bool operator < (long n);
};

number :: operator long() const { return _val; }
number number :: operator ++(int) { return number(_val++); }
number& number :: operator ++() { _val ++; return *this; }
bool number :: operator < (long n) { return _val < n; } 

#define microsec(t) (t.tv_sec * 1000000 + t.tv_usec)

int main() {
        struct timeval t1, t2;

        gettimeofday(&t1, NULL);
        for(number n = 0; n < 999999999L; ++n);
        gettimeofday(&t2, NULL);

        std::cout << (microsec(t2) - microsec(t1)) << std::endl;
}

When I run the above code, it takes approximately 3.3 seconds to complete.

When I add inline before the member function definitions, it takes around 4.6 seconds.

I can understand that inline may affect the performance if the size of the function body is big. But, in my case, they are very small. Hence, it should have either produced better performance or at least same performance. But, the execution time increases with inline.

Could someone please help me in understanding this behaviour?

[EDIT 1] The question is not exactly about the optimisation. But, to understand more on the inline. I understand that it is upto the compiler to decide whether or not to honour the inline keyword and chooses to optimise the code it deems fit. However, my question is on why does it have adverse effect on the performance (without explicit optimisation). I am able to consistently reproduce this behaviour.

As suggested by a few in this thread, I used https://gcc.godbolt.org to see the ASM commands the compiler generated with and without inline. I see that the ASM generated for main, for both the cases were identical. The only difference I see is that, with inline keyword, the ASM code wasn't generated for unused methods. However, the effective code generated were identical for both, so it should not have produced any difference in the runtime duration.

Just, in case, it matters, I am using gettimeofday(&_t2, NULL); to get current system time to find the time difference.

I am using g++ compiler, with -std=c++11 standard option. I am not using optimisation flags, as that is not the focus of my question.

[EDIT 2] Modified the code snippet to include the complete code to reproduce the problem.

What is your compilation command? Does it include optimizations? — chris, Apr 02 '19 at 03:21
Cache is a funny thing. But when in doubt, assembly output never lies. I'd look at that to see what is really happening when you inline... — Michael Dorgan, Apr 02 '19 at 03:38
When compiled with optimization they generate identical assembly code and almost identical runtimes for me. — Retired Ninja, Apr 02 '19 at 04:25
You didn't mention your compiler. I don't know about g++ and clang but in VS2013 with Debug mode, there are no inlines. No inlined member functions (which are implemented in headers), no inlined functions which are explicitly marked `inline` (the compiler has ever been able to just ignore this). The reason is that step-in/step-out is hard to realize with inlined functions. In Release mode, the compiler may decide on its own for any function which function is worth to be inlined at which call. — Scheff's Cat, Apr 02 '19 at 05:30
I am using g++ compiler. I am not using any optimisation options. The only option I use is -std=c++11. I am planning to use c++11, because, I am intending to use lambda functions. `g++ -std=c++11 -o tst tst.cpp` — sarva, Apr 02 '19 at 05:47
If you want to measure performance, you should activate optimization (`-O2` or `-O3`) and prevent debug code (not `-g`). I'm just about to write an answer which illustrates this. (Btw. I just fiddled with g++ on godbolt. What I said about VS2013 seems to be true in `g++` as well.) — Scheff's Cat, Apr 02 '19 at 05:51
You must also be very careful with benchmarks in general. If I read the gcc manual correctly, it does not apply any optimizations without a -O option (and that may mean that it does not inline at all!); but if it optimized, you must make sure that your loop has some side effect or it might be elided completely. — Peter - Reinstate Monica, Apr 02 '19 at 06:56
Optimization questions where you've compiled with optimizations disabled are not useful. There's nothing interesting that can be learned from this, aside from the nuances of the specific compiler's code-generator. And you didn't even mention which compiler you're using, so that can't even be the topic of an authoritative answer. By the way, @Peter, your comment is *mostly* right, but GCC doesn't actually have a mode where it applies no optimizations. It is rather unique in that sense. Even with `-O0` (default), you still get certain optimizations. You just don't get the *additional* ones. — Cody Gray - on strike, Apr 02 '19 at 06:59
@MichaelDorgan ASM lies often enough. In fact you can't know what the cpu does with it at the end of the day with all the fancy code reordering, caching, etc — FalcoGer, Apr 02 '19 at 07:04
@CodyGray Interesting. The manual is a bit vague there ("reduce compilation time and make debugging produce the expected results"). In any case it should be safe to assume that it does not inline so that stack traces show the function. — Peter - Reinstate Monica, Apr 02 '19 at 07:05
So why the run time diff (in case it's not a measuring artifact)? One reason which can always pop up without optimization is code alignment which can apparently have a large effect (and would be immediately optimized from -O1 on). Therefore irrelevant code changes (e.g. a simple re-ordering of independent statements) can have measurable performance impacts with -O0 which are simple random code layout artifacts. — Peter - Reinstate Monica, Apr 02 '19 at 07:08

FalcoGer · Answer 1 · 2019-04-02T06:34:24.383

Compilers with optimization enabled will know best when to inline functions. In fact declaring a function as inline may not put it inline, and not doing so may do. There is little reason.
I don't know what it is, but inlining a loop like that shouldn't affect execution time by 1.3 seconds as it would only increase the speed of the function call, not the code inside of the function.

1. No Inlining in Debug Code

From cppreference about inline

The original intent of the inline keyword was to serve as an indicator to the optimizer that inline substitution of a function is preferred over function call, that is, instead of executing the function call CPU instruction to transfer control to the function body, a copy of the function body is executed without generating the call. This avoids overhead created by the function call (passing the arguments and retrieving the result) but it may result in a larger executable as the code for the function has to be repeated multiple times.

Since this meaning of the keyword inline is non-binding, compilers are free to use inline substitution for any function that's not marked inline, and are free to generate function calls to any function marked inline. Those optimization choices do not change the rules regarding multiple definitions and shared statics listed above.

(I know this for sure in VS2013 and suspected this for other compilers also. Fiddling on godbolt told me that g++ seems to behave as well.)

To allow debugger commands for step-in and step-out, it seems to be necessary to supress inlined functions. Otherwise, it's probably hard to assign machine code addresses with line numbers in source code.

(Some years ago, I accidentally debugged C++ code (compiled with gcc) where functions were inlined (due to probably wrong compiler settings). The debugger (gdb) jumped in single step commands over multiple nested inline function calls not stopping before reaching code of most "inner" function. That was really painfully.)

Example:

#include <iostream>

int add(int a, int b) { return a + b; }

inline int sub(int a, int b) { return a - b; }

int main()
{
  int a, b; std::cin >> a >> b;
  std::cout << add(a, b);
  std::cout << sub(a, b);
  return 0;
}

Excerpt from generated code for g++ -std=c++17 -g...:

; 10: std::cout << add(a, b);
        mov     edx, DWORD PTR [rbp-8]
        mov     eax, DWORD PTR [rbp-4]
        mov     esi, edx
        mov     edi, eax
        call    add(int, int)
        mov     esi, eax
        mov     edi, OFFSET FLAT:_ZSt4cout
        call    std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
; 11: std::cout << sub(a, b);
        mov     edx, DWORD PTR [rbp-8]
        mov     eax, DWORD PTR [rbp-4]
        mov     esi, edx
        mov     edi, eax
        call    sub(int, int)
        mov     esi, eax
        mov     edi, OFFSET FLAT:_ZSt4cout
        call    std::basic_ostream<char, std::char_traits<char> >::operator<<(int)

Even with little knowledge of ASM, the call add(int, int) and call sub(int, int) are easy to recognize.

Excerpt from generated code for g++ -std=c++17 -O2...:

; 10: std::cout << add(a, b);
        mov     esi, DWORD PTR [rsp+8]
        mov     edi, OFFSET FLAT:_ZSt4cout
        add     esi, DWORD PTR [rsp+12]
        call    std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
; 11: std::cout << sub(a, b);
        mov     esi, DWORD PTR [rsp+8]
        mov     edi, OFFSET FLAT:_ZSt4cout
        sub     esi, DWORD PTR [rsp+12]
        call    std::basic_ostream<char, std::char_traits<char> >::operator<<(int)

Instead of calling functions add() and sub(), the compiler just inline both. The resp. commands are add esi, DWORD PTR [rsp+12] and sub esi, DWORD PTR [rsp+12].

Full Example on godbolt

2. `assert()`

It is very usual (also in std library classes and functions) to enrich code with assert()s which shall help to find implementation bugs in debug mode but are excluded in release mode to gain additional performance.

Possible implementation (from cppreference.com):

#ifdef NDEBUG
#define assert(condition) ((void)0)
#else
#define assert(condition) /*implementation defined*/
#endif

Hence, when code is compiled in debug mode, the performance measurement will measure all these assert expressions as well. Usually, this is considered as distorted result.

Example:

#include <cassert>
#include <iostream>

int main()
{
  int n; std::cin >> n;
  assert(n > 0); // (stupid idea to assert user input)
  std::cout << n;
  return 0;
}

Excerpt from generated code for g++ -std=c++17 -D_DEBUG...:

; 6: int n; std::cin >> n;
        lea     rax, [rbp-4]
        mov     rsi, rax
        mov     edi, OFFSET FLAT:_ZSt3cin
        call    std::basic_istream<char, std::char_traits<char> >::operator>>(int&)
; 7: assert(n > 0); // (stupid idea to assert user input)
        mov     eax, DWORD PTR [rbp-4]
        test    eax, eax
        jg      .L2
        mov     ecx, OFFSET FLAT:main::__PRETTY_FUNCTION__
        mov     edx, 7
        mov     esi, OFFSET FLAT:.LC0
        mov     edi, OFFSET FLAT:.LC1
        call    __assert_fail
.L2:
; 8: std::cout << n;
        mov     eax, DWORD PTR [rbp-4]
        mov     esi, eax
        mov     edi, OFFSET FLAT:_ZSt4cout
        call    std::basic_ostream<char, std::char_traits<char> >::operator<<(int)

Excerpt from generated code for g++ -std=c++17 -DNDEBUG...:

; 6: int n; std::cin >> n;
        lea     rax, [rbp-4]
        mov     rsi, rax
        mov     edi, OFFSET FLAT:_ZSt3cin
        call    std::basic_istream<char, std::char_traits<char> >::operator>>(int&)
; 8: std::cout << n;
        mov     eax, DWORD PTR [rbp-4]
        mov     esi, eax
        mov     edi, OFFSET FLAT:_ZSt4cout
        call    std::basic_ostream<char, std::char_traits<char> >::operator<<(int)

Full Example on godbolt

Trying to Reproduce the Claim of OP...

Disclaimer: I'm not a professional concerning benchmarking. If you want to get another opinion, this might be interesting: CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!".

Nevertheless, I see some weaknesses in the tests of OP.

Both kinds of test should happen in the same code (to be testable in as near as comparable conditions).
Before testing CPU speed, it is recommended to make a "Warm up".
Modern compilers make dataflow analysis. To be sure, that relevant code isn't optimized away, side effects have to be carefully put in. On the other hand, these side effects may weaken the measurement.
Modern compilers are clever to compute as much as possible during compile time. To prevent this, I/O should be used for initialization of relevant parameters.
Measurement in general should be repeated to average measurement errors. (This is something I learnt decades ago in Physics lessons when I was a school boy.)

These things in mind, I modified the exposed sample code of OP a bit:

#include <iostream>
#include <iomanip>
#include <vector>
#include <sys/time.h>

class Number {
  protected:
    long _val;

  public:
    Number(long n): _val(n) { }

    operator long() const;

    Number operator ++(int);
    Number& operator ++();

    bool operator < (long n) const;
};

Number :: operator long() const { return _val; }
Number Number :: operator ++(int) { return Number(_val++); }
Number& Number :: operator ++() { ++_val; return *this; }
bool Number :: operator < (long n) const { return _val < n; }

class NumberInline {

  protected:
    long _val;

  public:
    NumberInline(long n): _val(n) { }

    operator long() const { return _val; }

    NumberInline operator ++(int) { return NumberInline(_val++); }
    NumberInline& operator ++() { ++_val; return *this; }

    bool operator < (long n) const { return _val < n; }
};

long microsec(timeval &t) { return t.tv_sec * 1000000L + t.tv_usec; }

int main()
{
  timeval t1, t2, t3;

  // heat-up of CPU
  std::cout << "Heating up...\n";
  gettimeofday(&t1, nullptr);
  Number n(0), nMax(0);
  do {
    ++n; ++nMax;
    gettimeofday(&t2, nullptr);
  } while (microsec(t2) < microsec(t1) + 3000000L /* 3s */);
  // do experiment
  std::cout << "Starting experiment...\n";
  int nExp; if (!(std::cin >> nExp)) return 1;
  long max; if (!(std::cin >> max)) return 1;
  std::vector<std::pair<long, long>> t;
  for (int i = 0; i < nExp; ++i) {
    Number n = 0; NumberInline nI = 0;
    gettimeofday(&t1, nullptr);
    for (n = 0; n < max; ++n);
    gettimeofday(&t2, nullptr);
    for (nI = 0; nI < max; ++nI);
    gettimeofday(&t3, nullptr);
    std::cout << "n: " << n << ", nI: " << nI << '\n';
    t.push_back(std::make_pair(
      microsec(t2) - microsec(t1),
      microsec(t3) - microsec(t2)));
    std::cout << "t[" << i << "]: { "
      << std::setw(10) << t[i].first << "us, "
      << std::setw(10) << t[i].second << "us }\n";
  }
  double tAvg0 = 0.0, tAvg1 = 0.0;
  for (const std::pair<long, long> &tI : t) {
    tAvg0 += tI.first; tAvg1 += tI.second;
  }
  tAvg0 /= nExp; tAvg1 /= nExp;
  std::cout << "Average times: " << std::fixed
    << tAvg0 << "us, " << tAvg1 << "us\n";
  std::cout << "Ratio: " << tAvg0 / tAvg1 << "\n";
  return 0;
}

I compiled and tested this in cygwin64 on Windows 10 (64 bit):

$ g++ --version
g++ (GCC) 7.3.0

$ echo "10 1000000000" | (g++ -std=c++11 -O0 testInline.cc -o testInline && ./testInline)
Heating up...
Starting experiment...
n: 1000000000, nI: 1000000000
t[0]: {    4811515us,    4579710us }
n: 1000000000, nI: 1000000000
t[1]: {    4703022us,    4649293us }
n: 1000000000, nI: 1000000000
t[2]: {    4725413us,    4724408us }
n: 1000000000, nI: 1000000000
t[3]: {    4777736us,    4744561us }
n: 1000000000, nI: 1000000000
t[4]: {    4807298us,    4831872us }
n: 1000000000, nI: 1000000000
t[5]: {    4853159us,    4616783us }
n: 1000000000, nI: 1000000000
t[6]: {    4818285us,    4769500us }
n: 1000000000, nI: 1000000000
t[7]: {    4753801us,    4693287us }
n: 1000000000, nI: 1000000000
t[8]: {    4781828us,    4439588us }
n: 1000000000, nI: 1000000000
t[9]: {    4125942us,    4090368us }
Average times: 4715799.900000us, 4613937.000000us
Ratio: 1.022077

$ echo "10 1000000000" | (g++ -std=c++11 -O1 testInline.cc -o testInline && ./testInline)
Heating up...
Starting experiment...
n: 1000000000, nI: 1000000000
t[0]: {     395756us,     381372us }
n: 1000000000, nI: 1000000000
t[1]: {     410973us,     395130us }
n: 1000000000, nI: 1000000000
t[2]: {     383708us,     376009us }
n: 1000000000, nI: 1000000000
t[3]: {     399632us,     373718us }
n: 1000000000, nI: 1000000000
t[4]: {     362056us,     398840us }
n: 1000000000, nI: 1000000000
t[5]: {     370812us,     397596us }
n: 1000000000, nI: 1000000000
t[6]: {     381679us,     392219us }
n: 1000000000, nI: 1000000000
t[7]: {     371318us,     396928us }
n: 1000000000, nI: 1000000000
t[8]: {     404398us,     433730us }
n: 1000000000, nI: 1000000000
t[9]: {     370402us,     356458us }
Average times: 385073.400000us, 390200.000000us
Ratio: 0.986862

$ echo "10 1000000000" | (g++ -std=c++11 -O2 testInline.cc -o testInline && ./testInline)
Heating up...
Starting experiment...
n: 1000000000, nI: 1000000000
t[0]: {          1us,          0us }
n: 1000000000, nI: 1000000000
t[1]: {          0us,          0us }
n: 1000000000, nI: 1000000000
t[2]: {          0us,          0us }
n: 1000000000, nI: 1000000000
t[3]: {          0us,          0us }
n: 1000000000, nI: 1000000000
t[4]: {          0us,          0us }
n: 1000000000, nI: 1000000000
t[5]: {          0us,          0us }
n: 1000000000, nI: 1000000000
t[6]: {          0us,          0us }
n: 1000000000, nI: 1000000000
t[7]: {          0us,          0us }
n: 1000000000, nI: 1000000000
t[8]: {          0us,          0us }
n: 1000000000, nI: 1000000000
t[9]: {          0us,          1us }
Average times: 0.100000us, 0.100000us
Ratio: 1.000000

$

Concerning the last test (with -O2), I have serious doubts whether the code in quest is executed at runtime. I loaded the file into godbolt (same g++ version, same options) to get a clue. I had serious problems to interprete the generated code but here it is: Sample in Compiler Explorer.

Finally, the Ratios of first two experiments (close to 1) tell me

Differences are very probably mere measurement noise.
I cannot reproduce the statement of OP that inlined code runs with significantly different time than non-inlined.

(That beside of the serious doubts which have already been told by comments and in FalcoGer's answer.)

In modern compilers, the `inline` keyword is ignored entirely by the optimizer. The optimizer makes its own decisions about which functions to inline based on heuristics. The `inline` keyword only has meaning to the linker: ignore multiply-defined symbol errors on this function and treat all definitions as identical. — Cody Gray - on strike, Apr 02 '19 at 07:03
@CodyGray Yeah, that's mentioned as well on the linked cppreference page. I found the cited part (even) more sufficient for my explanation. — Scheff's Cat, Apr 02 '19 at 07:06
The statement made in the quoted text is less strong than the statement I made. It's been true since the beginning of time that compilers/optimizers were free to ignore `inline`; it's always been a mere hint. Nowadays, they *all* ignore it. It *only* has meaning to the linker. Either way, your answer is drawing a somewhat different conclusion from that fact. When compiling with optimizations disabled (in "debug" mode), the code-gen doesn't do inline function call expansion because it makes debugging difficult. — Cody Gray - on strike, Apr 02 '19 at 07:08
@CodyGray ...agree partly. They couldn't do non-inlining for debug compiling if compilers weren't allowed to do this in general. ;-) I found it relevant to illustrate why a compiler shouldn't inline everything if it is allowed to do so. (The argument of "inline means generation of more code" doesn't hold in this example.) — Scheff's Cat, Apr 02 '19 at 07:11
@Scheff, I compared the ASM code generated with and without inline keyword. They were identical, except that ASM code wasn't generated for the unused methods. Hence, I expected the runtime duration also to be exactly same for both the cases. But, I am able to consistently see that it takes more time to complete the with `inline`. — sarva, Apr 03 '19 at 07:13
@sarva Hard to believe. Can you make a [mcve] that makes your issue reproducible? (Though your question is on hold, you still should be able to edit it. Might be, that's even re-opened if you can provide a convincing, reproducible example.) — Scheff's Cat, Apr 03 '19 at 07:14
@Scheff, I edited the question to include the complete code. — sarva, Apr 03 '19 at 08:17
@sarva I fiddled a bit with your example and tried to reproduce what you said. I extended my answer resp. (Spoiler alert: I couldn't reproduce. However, you may try on your own. I believe you did something wrong in your measurement.) — Scheff's Cat, Apr 03 '19 at 10:16
@sarva Just for fun, I tried again without any `-O` -> same result. ;-) — Scheff's Cat, Apr 03 '19 at 10:26
@scheff, compiling the code with -O flags will not be useful for this question, because, the code generated with optimisation flags is quite different from the actual code and doesn't have any of the class construct with respect to this example. However, since you mentioned that even without -O flags, both the versions produce same runtime duration, I am surprised. I'll retest with warm up first as you suggested and post my observation. Thanks for the all the help you have been providing so far. — sarva, Apr 04 '19 at 05:41
@sarva For productive code/deploy, I would exclusively compile with `-O2`. Which knowledge can you gain from comparison of non-optimized code? IMHO, the sense of optimization is to be able to write straight forward, easy to read code without being afraid of sub-optimal performance at runtime... — Scheff's Cat, Apr 04 '19 at 05:45
@sarva And btw.: See above: _Just for fun, I tried again without any `-O` -> same result._ — Scheff's Cat, Apr 04 '19 at 06:01

I am seeing that specifying my class member function as inline actually increases the execution time, even though the function body is very small

2 Answers2

1. No Inlining in Debug Code

2. assert()

Trying to Reproduce the Claim of OP...

2. `assert()`