Why is inlining considered faster than a function call?

Question

Now, I know it's because there's not the overhead of calling a function, but is the overhead of calling a function really that heavy (and worth the bloat of having it inlined) ?

From what I can remember, when a function is called, say f(x,y), x and y are pushed onto the stack, and the stack pointer jumps to an empty block, and begins execution. I know this is a bit of an oversimplification, but am I missing something? A few pushes and a jump to call a function, is there really that much overhead?

Let me know if I'm forgetting something, thanks!

Depending on the particular implementation, there's no absolute guarantee that a stack even exists, so the details of exactly what happens when a function is called can vary somewhat between platforms. As for whether the overhead is significant... It depends on what you are doing. — wrosecrans, Oct 25 '10 at 15:49
Not always, riders are also there. http://www.parashift.com/c++-faq-lite/inline-functions.html#faq-9.3 — DumbCoder, Oct 25 '10 at 15:51
There is no need to worry about this. The compiler analysis the code and decides if a function should be inlined (it ignores any hints you may give). As the compiler knows so much more about how the function is being called it can make an informed cost/benefit analysis and decide on a case by cases basis weather it is worth the effort of inlining the code. There is **NO** user interaction required (nor should you try and beat the compiler). — Martin York, Oct 25 '10 at 15:57
@Martin York: He is not asking if he should inline a function, he's asking why inlining matters. As such, it is a perfectly valid thing to try to understand. — Chris Pitman, Oct 25 '10 at 16:21
Sharing a link to a journal article that addresses this question very well: http://accu.org/index.php/journals/449 — Paul Sasik, Oct 25 '10 at 17:37
@Martin: I've seen this claimed often, I have, however, yet to see a backup with real numbers that _not_ giving the compiler inlining hints produces better code. Don't get me wrong, I don't doubt it will someday, I'm just asking whether we already have this day. — sbi, Oct 25 '10 at 19:34
@sbi: I am not saying it is impossible to do better than the compiler. What I will say is that unless you are both a c++ wizz and an assembler wizz and a compiler wizz you will not beat the compiler. The set of people that can actually do a better job than the compiler is small and they know who they are themselves. For these people manual inlining is OK. For the rest of us there are just too many factors that make it imposable to make an informed decision. Then only way is to try it both ways and time and even then it will hold only as long as nothing else changes. — Martin York, Oct 25 '10 at 19:44
@Martin: So you're saying that, if an average C++ programmer does _not_ use `inline`, his code will be faster than if he does? (That that I claim that, when it comes to performance, the only wizards and gurus are profilers.) Again, is this backed up by any studies, or is this just an guess, however well educated? — sbi, Oct 25 '10 at 20:03
@sbi: If the keyword actually did something then I would say: `A developers code will not be any slower if he does not use it` (not quite the same thing). But the compiler did use this tag in the early days. So the question becomes `why did they turn it off`? I assume they managed to show that the compiler was at least as good as a human in all situations and better in some. Note: Once a guru human learns a technique you just need to update the compiler and it will always use that technique while the spread of the technique through the human population is very slow (and error prone). — Martin York, Oct 25 '10 at 20:21
@Martin: But that's what I'm questioning: Did they really turn it off? Does the `inline` keyword really have _no effect_ on inlining anymore (on popular compilers)? — sbi, Oct 26 '10 at 05:07
I die a little inside whenever a question about speed comes up and it's related to C++. — Finglas, Oct 26 '10 at 12:09
@Martin: Who turned it off? Inlining in gcc (C99) still works, that is produces code with functions inlined in -O3. Sure, deciding if and when to inline is a decision only to be backed up by measurements, but the compiler is most certainly not ignoring us. — Michael Foukarakis, Oct 26 '10 at 12:20
@ Michael Foukarakis: Nobody turned off `inlining the feature`. The compiler no longer uses the `inline keyword` to determine weather to inline the code. Thus the compiler is definitely ignoring you (unless you force it to to otherwise). — Martin York, Oct 26 '10 at 17:28

AnT stands with Russia · Accepted Answer · 2010-10-26T08:04:12.073

65

Aside from the fact that there's no call (and therefore no associated expenses, like parameter preparation before the call and cleanup after the call), there's another significant advantage of inlining. When the function body is inlined, it's body can be re-interpreted in the specific context of the caller. This might immediately allow the compiler to further reduce and optimize the code.

For one simple example, this function

void foo(bool b) {
  if (b) {
    // something
  }
  else {
    // something else
  }
}

will require actual branching if called as a non-inlined function

foo(true);
...
foo(false);

However, if the above calls are inlined, the compiler will immediately be able to eliminate the branching. Essentially, in the above case inlining allows the compiler to interpret the function argument as a compile-time constant (if the parameter is a compile-time constant) - something that is generally not possible with non-inlined functions.

However, it is not even remotely limited to that. In general, the optimization opportunities enabled of inlining are significantly more far-reaching. For another example, when the function body is inlined into the specific caller's context, the compiler in general case will be able to propagate the known aliasing-related relationships present in the calling code into the inlined function code, thus making it possible to optimize the function's code better.

Again, the possible examples are numerous, all of them stemming from the basic fact that inlined calls are immersed into the specific caller's context, thus enabling various inter-context optimizations, which would not be possible with non-inlined calles. With inlining you basically get many individual versions of your original function, each version is tailored and optimized individually for each specific caller context. The price of that is, obviously, the potential danger of code bloat, but if used correctly, it can provide noticeable performance benefits.

edited Oct 26 '10 at 08:04

answered Oct 25 '10 at 15:34

AnT stands with Russia

312,472
42
525
765

7

Another sweet optimization that inline affords you is instruction-cache efficiency. It's far more likely that inlined code is already in the cache, whereas called code could easily cause a cache miss. – Detmar Oct 25 '10 at 19:10
@Detmar: Maybe. And maybe not. From what I know about instruction caches (very little, admittedly), you usually need to measure in order to know, and more often than not the result seems funny and strange. – sbi Oct 25 '10 at 19:24
2

@Detmar, @sbi: Agreed that this can be mysterious. Using inlines can push hot code out of the sweet L1 instruction cache while using function calls means each function is in cache independently, using less cache space. This is why code compiled on GCC with -Os (reduce size) can be counter-intuitively faster than O2 or O3. – Zan Lynx Oct 25 '10 at 20:45
1

Yes!!! Function body is inlined and ... suddenly the compiler can eliminate most of the code. This is number one reason why inlining (especially with link-time code generation) is a great thing. – sharptooth Oct 26 '10 at 07:54
@andreyT: It's good to mention that one doesn't need to worry too much, unless the function is potentially called X million+ times from within some loop or the other. In that case, every cycle saved, can have seconds speed gain. If it's just some trivial function, don't inline. – Toad Oct 26 '10 at 08:17
Inlining can push hot code out of the L1 cache, but only when the inlined function itself isn't hot. The reason is simple: an inlined version of a function is smaller because there is no call instruction, no argument passing, and no return value passing. – MSalters Oct 26 '10 at 09:01
@MSalters: except if the inlined function is called multiple times. (or, I guess, if the called function is inlined multiple times? If the multiply called function is repeatedly inlined? ;)) Then you might get multiple copies of the same code polluting L1 cache. – jalf Oct 26 '10 at 10:59
@sbi: It certainly can appear mysterious, especially since it's architecture-specific behaviour. At least on x86 systems, however, Detmar is right. Cache (line) sizes and mysteriously inlining 'cold' code notwithstanding, of course. ;) – Michael Foukarakis Oct 26 '10 at 12:23
@AnT: what is "known aliasing-related relationships"? What do you mean by that? – Destructor Jun 20 '15 at 13:53
1

@meet: I mean that, for example, inside function `void foo(int *a, int *b)` the compiler can't make any assumptions about aliasing: `a` and `b` can point to the same object or to different objects. Either variants offers optimization opportunities, but the compiler cannot take advantage of these opportunities. But at higher level (in caller context) this information might be available. For example, when inlining `int x; foo(&x, &x);` call, the compiler can immediately optimize for `a == b` condition. Likewise for `int x, y; foo(&x, &y);` the compiler can optimize for `a != b`. – AnT stands with Russia Jun 20 '15 at 15:51
@AnT: Thanks for explanation giving nice & easy to understand example. – Destructor Jun 21 '15 at 05:24

score 27 · Answer 2 · edited Dec 30 '14 at 02:03

27

"A few pushes and a jump to call a function, is there really that much overhead?"

It depends on the function.

If the body of the function is just one machine code instruction, the call and return overhead can be many many hundred %. Say, 6 times, 500% overhead. Then if your program consists of nothing but a gazillion calls to that function, with no inlining you've increased the running time by 500%.

However, in the other direction inlining can have a detrimental effect, e.g. because code that without inlining would fit in one page of memory doesn't.

So the answer is always when it comes to optimization, first of all MEASURE.

edited Dec 30 '14 at 02:03

oɔɯǝɹ

7,219
7
58
69

answered Oct 25 '10 at 15:37

Cheers and hth. - Alf

142,714
15
209
331

11

Moreover, a really short function might be smaller than the setup and teardown instructions for a function call, and inlining might actually make the code smaller. Measure and profile. – David Thornley Oct 25 '10 at 16:57

Pontus Gagge · Answer 3 · 2010-10-25T15:36:50.380

There is no calling and stack activity, which certainly saves a few CPU cycles. In modern CPU's, code locality also matters: doing a call can flush the instruction pipeline and force the CPU to wait for memory being fetched. This matters a lot in tight loops, since primary memory is quite a lot slower than modern CPU's.

However, don't worry about inlining if your code is only being called a few times in your application. Worry, a lot, if it's being called millions of times while the user waits for answers!

sbi · Answer 4 · 2010-10-26T12:15:52.217

The classic candidate for inlining is an accessor, like std::vector<T>::size().

With inlining enabled this is just the fetching of a variable from memory, likely a single instruction on any architectures. The "few pushes and a jump" (plus the return) is easily multiple times as much.

Add to that the fact that, the more code is visible at once to an optimizer, the better it can do its work. With lots of inlining, it sees lots of code at once. That means that it might be able to keep the value in a CPU register, and completely spare the costly trip to memory. Now we might take about a difference of several orders of magnitude.

And then theres template meta-programming. Sometimes this results in calling many small functions recursively, just to fetch a single value at the end of the recursion. (Think of fetching the value of the first entry of a specific type in a tuple with dozens of objects.) With inlining enabled, the optimizer can directly access that value (which, remember, might be in a register), collapsing dozens of function calls into accessing a single value in a CPU register. This can turn a terrible performance hog into a nice and speedy program.

Hiding state as private data in objects (encapsulation) has its costs. Inlining was part of C++ from the very beginning in order to minimize these costs of abstraction. Back then, compilers were significantly worse in detecting good candidates for inlining (and rejecting bad ones) than they are today, so manually inlining resulted in considerable speed gainings.
Nowadays compilers are reputed to be much more clever than we are about inline. Compilers are able to inline functions automatically or don't inline functions users marked as inline, even though they could. Some say that inlining should be left to the compiler completely and we shouldn't even bother marking functions as inline. However, I have yet to see a comprehensive study showing whether manually doing so is still worth it or not. So for the time being, I'll keep doing it myself, and let the compiler override that if it thinks it can do better.

I really like this example. I hadn't thought much about accessors and recursive functions for templates. Thanks so much! — kodai, Oct 25 '10 at 18:12
You misspelled _costs_ in the first sentence below the divider. — Core Xii, Oct 26 '10 at 11:59

score 5 · Answer 5 · answered Oct 25 '10 at 15:29

5

let

int sum(const int &a,const int &b)
{
     return a + b;
}
int a = sum(b,c);

is equal to

int a = b + c

No jump - no overhead

answered Oct 25 '10 at 15:29

kilotaras

1,419
9
24

Better yet: "int a=sum(4,5);" can become "int a=9;". Also, reading and writing variables through references is generally slower than reading and writing them directly; in many cases an in-lined function can be resolved to use faster direct-variable access (note that in your scenario, if not in-line, it would be better to pass variables by value rather than by reference, but if the function did something like "a+=b;" the reference would be necessary). – supercat Oct 25 '10 at 15:43
The statement `reading and writing variables through references is generally slower than reading and writing them directly` is way to general to be true. I also find it highly unlikely in most normal situations (see how easy it is to over generalize). – Martin York Oct 25 '10 at 16:03
And how often do we write functions like `sum()`? I think accessors are a much more relevant example for what inlining does. – sbi Oct 25 '10 at 18:45

score 5 · Answer 6 · answered Oct 25 '10 at 15:41

Consider a simple function like:

int SimpleFunc (const int X, const int Y)
{
    return (X + 3 * Y); 
}    

int main(int argc, char* argv[])
{
    int Test = SimpleFunc(11, 12);
    return 0;
}

This is converted to the following code (MSVC++ v6, debug):

10:   int SimpleFunc (const int X, const int Y)
11:   {
00401020   push        ebp
00401021   mov         ebp,esp
00401023   sub         esp,40h
00401026   push        ebx
00401027   push        esi
00401028   push        edi
00401029   lea         edi,[ebp-40h]
0040102C   mov         ecx,10h
00401031   mov         eax,0CCCCCCCCh
00401036   rep stos    dword ptr [edi]

12:       return (X + 3 * Y);
00401038   mov         eax,dword ptr [ebp+0Ch]
0040103B   imul        eax,eax,3
0040103E   mov         ecx,dword ptr [ebp+8]
00401041   add         eax,ecx

13:   }
00401043   pop         edi
00401044   pop         esi
00401045   pop         ebx
00401046   mov         esp,ebp
00401048   pop         ebp
00401049   ret

You can see that there are just 4 instructions for the function body but 15 instructions for just the function overhead not including another 3 for calling the function itself. If all instructions took the same time (they don't) then 80% of this code is function overhead.

For a trivial function like this there is a good chance that the function overhead code will take just as long to run as the main function body itself. When you have trivial functions that are called in a deep loop body millions/billions of times then the function call overhead begins to become large.

As always, the key is profiling/measuring to determine whether or not inlining a specific function yields any net performance gains. For more "complex" functions that are not called "often" the gain from inlining may be immeasurably small.

This is a debug build, there is memory guarding going on and an oversized stack frame to allow for edit-and-continue. You mustn't use debug code to analyse optimisations! — Skizz, Oct 25 '10 at 16:16

score 4 · Answer 7 · answered Oct 25 '10 at 15:36

There are multiple reasons for inlining to be faster, only one of which is obvious:

No jump instructions.
better localization, resulting in better cache utilization.
more chances for the compiler's optimizer to make optimizations, leaving values in registers for example.

The cache utilization can also work against you - if inlining makes the code larger, there's more possibility of cache misses. That's a much less likely case though.

score 3 · Answer 8 · answered Oct 25 '10 at 15:45

A typical example of where it makes a big difference is in std::sort which is O(N log N) on its comparison function.

Try creating a vector of a large size and call std::sort first with an inline function and second with a non-inlined function and measure the performance.

This, by the way, is where sort in C++ is faster than qsort in C, which requires a function pointer.

score 2 · Answer 9 · answered Oct 25 '10 at 15:28

2

One other potential side effect of the jump is that you might trigger a page fault, either to load the code into memory the first time, or if it's used infrequently enough to get paged out of memory later.

answered Oct 25 '10 at 15:28

Jimmy

27,142
5
87
100

Clifford · Answer 10 · 2010-10-26T11:48:45.193

(and worth the bloat of having it inlined)

It is not always the case that in-lining results in larger code. For example a simple data access function such as:

int getData()
{
   return data ;
}

will result in significantly more instruction cycles as a function call than as an in-line, and such functions are best suited to in-lining.

If the function body contains a significant amount of code the function call overhead will indeed be insignificant, and if it is called from a number of locations, it may indeed result in code bloat - although your compiler is as likely to simply ignore the inline directive in such cases.

You should also consider the frequency of calling; even for a large-ish code body, if the function is called frequently from one location, the saving may in some cases be worthwhile. It comes down to the ratio of call-overhead to code body size, and the frequency of use.

Of course you could just leave it up to your compiler to decide. I only ever explicitly in-line functions that comprise of a single statement not involving a further function call, and that is more for speed of development of class methods than for performance.

score 2 · Answer 11 · answered Oct 26 '10 at 11:16

Andrey's answer already gives you a very comprehensive explanation. But just to add one point that he missed, inlining can also be extremely valuable on very short functions.

If a function body consists of just a few instructions, then the prologue/epilogue code (the push/pop/call instructions, basically) might actually be more expensive than the function body itself. If you call such a function often (say, from a tight loop), then unless the function is inlined, you can end up spending the majority of your CPU time on the function call, rather than the actual contents of the function.

What matters isn't really the cost of a function call in absolute terms (where it might take just 5 clock cycles or something like that), but how long it takes relative to how often the function is called. If the function is so short that it can be called every 10 clock cycles, then spending 5 cycles for every call on "unnecessary" push/pop instructions is pretty bad.

Yes, and also when a function only contains a few instructions whose might be significantly reduced when the function body is optimized in the caller context. So instead of prologue/epilogue + say ten instructions you may end up with no prologue, no epilogue and maybe four instructions which gives a huge performance gain. — sharptooth, Oct 26 '10 at 11:22

Raphael · Answer 12 · 2010-10-25T15:32:11.410

1

Because there's no call. The function code is just copied

edited Oct 25 '10 at 15:32

answered Oct 25 '10 at 15:24

Raphael

7,972
14
62
83

@kodal, the call stack doesn't have instructions. At least, not in any normal code. – kanaka Oct 25 '10 at 15:29
Instructions don't go on the stack, @Kodai, and since there's no function call, there's certainly nothing extra on the call stack. – Rob Kennedy Oct 25 '10 at 15:30
Your question doesn't make sense. The "call stack" does not get bloated by inline instructions. The system doesn't need to keep track of how many instructions there are in the current function. Inlining just takes the code that was in the called function and splices it into the calling function, so that instead of adding a frame to the call stack, the code just executes. – Yuliy Oct 25 '10 at 15:30
Inlining avoids bloating the call stack! But it could make the code to big to fit in the cache, so yes, inlining can also reduce performance. Along with the increase in code size, that's what stops compilers from inlining everything. – Oct 25 '10 at 15:32
1

-1 he didn't ask what inline is, he asked if it really was significantly faster. – o0'. Oct 25 '10 at 18:28
Which is also answered here: the code is faster because there's no function call – Raphael Oct 25 '10 at 19:13

score 1 · Answer 13 · answered Oct 25 '10 at 15:37

1

Inlining a function is a suggestion to compiler to replace function call with definition. If its replaced, then there will be no function calling stack operations [push, pop]. But its not guaranteed always. :)

--Cheers

answered Oct 25 '10 at 15:37

Koteswara sarma

434
3
7

score 1 · Answer 14 · answered Oct 25 '10 at 15:44

1

Optimizing compilers apply a set of heuristics to determine whether or not inlining will be beneficial.

Sometimes gain from the lack of function call will outweigh the potential cost of the extra code, sometimes not.

answered Oct 25 '10 at 15:44

JoeG

12,994
1
38
63

score 0 · Answer 15 · answered Oct 25 '10 at 15:27

0

Inlining makes the big difference when a function is called multiple times.

answered Oct 25 '10 at 15:27

MicSim

26,265
16
90
133

Could you explain please? Thanks! – kodai Oct 25 '10 at 15:28
See the other answers which also elaborate on this point. Additionaly check the following link for some points about inlining and performance: http://www.parashift.com/c++-faq-lite/inline-functions.html#faq-9.3 – MicSim Oct 25 '10 at 15:49

score -1 · Answer 16 · answered Oct 25 '10 at 15:25

-1

Because no jump is performed.

answered Oct 25 '10 at 15:25

Johann Gerell

24,991
10
72
122

This is not strictly true on modern Intel cpus. The prefetch unit will follow unconditional, direct jumps so there is no direct overhead. The OS may introduce an overhead if the target address causes a page fault. EDIT: what I meant was, the presence or abscence of a jmp instruction makes no difference. – Skizz Oct 25 '10 at 16:26
There is still a jump, even if the CPU is quite good at handling it. Its performance impact is just much more subtle than most people realize, but it's not exactly "free" either. – jalf Oct 26 '10 at 11:52

Why is inlining considered faster than a function call?

16 Answers16

Linked