Do modern C++ compilers inline functions which are called exactly once?

Question

As in, say my header file is:

class A
{
    void Complicated();
}

And my source file

void A::Complicated()
{
    ...really long function...
}

Can I split the source file into

void DoInitialStuff(pass necessary vars by ref or value)
{
    ...
}
void HandleCaseA(pass necessary vars by ref or value)
{
    ...
}
void HandleCaseB(pass necessary vars by ref or value)
{
    ...
}
void FinishUp(pass necessary vars by ref or value)
{
    ...
}
void A::Complicated()
{
    ...
    DoInitialStuff(...);
    switch ...
        HandleCaseA(...)
        HandleCaseB(...)
    ...
    FinishUp(...)
}

Entirely for readability and without any fear of impact in terms of performance?

Maybe, maybe not. The compiler programmer might be your best bet, depedning upon what compiler you are using. — DumbCoder, Aug 17 '11 at 16:44
I wonder what the point of inlining would be... calls are quite cheap as it is. — , Aug 17 '11 at 16:45
None of this even happens to be in a loop? Exactly how much time do you hope to gain from avoiding the overhead of a couple of function calls? A nanosecond? — UncleBens, Aug 17 '11 at 16:45
Small functions that get called a lot benefit from inlining. A function that's sufficiently larger than the function call overhead will not benefit from inlining, so I wouldn't worry about it. — Gabe, Aug 17 '11 at 16:45
Declare your internal functions as `static` to give them file scope. They may be inlined even if you don't do this. But if they are not `static`, they will have to be exported, which means a non-inlined version will have to be generated even if it's never used. — Conspicuous Compiler, Aug 17 '11 at 16:47
@UncleBens: Yes it could be in a loop. I am just saying in the code the function would only be referenced once. — Cookie, Aug 17 '11 at 16:50
@pst: Small functions called often will benefit. The function call overhead may be comparable to the amount of work done. — phkahler, Aug 17 '11 at 17:32

score 11 · Accepted Answer · edited Aug 17 '11 at 17:41

You should mark the functions static so that the compiler know they are local to that translation unit.

Without static the compiler cannot assume (barring LTO / WPA) that the function is only called once, so is less likely to inline it.

Demonstration using the LLVM Try Out page.

That said, code for readability first, micro-optimizations (and such tweaking is a micro-optimization) should only come after performance measures.

Example:

#include <cstdio>

static void foo(int i) {
  int m = i % 3;
  printf("%d %d", i, m);
}

int main(int argc, char* argv[]) {
  for (int i = 0; i != argc; ++i) {
    foo(i);
  }
}

Produces with static:

; ModuleID = '/tmp/webcompile/_27689_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

@.str = private constant [6 x i8] c"%d %d\00"     ; <[6 x i8]*> [#uses=1]

define i32 @main(i32 %argc, i8** nocapture %argv) nounwind {
entry:
  %cmp4 = icmp eq i32 %argc, 0                    ; <i1> [#uses=1]
  br i1 %cmp4, label %for.end, label %for.body

for.body:                                         ; preds = %for.body, %entry
  %0 = phi i32 [ %inc, %for.body ], [ 0, %entry ] ; <i32> [#uses=3]
  %rem.i = srem i32 %0, 3                         ; <i32> [#uses=1]
  %call.i = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([6 x i8]* @.str, i64 0, i64 0), i32 %0, i32 %rem.i) nounwind ; <i32> [#uses=0]
  %inc = add nsw i32 %0, 1                        ; <i32> [#uses=2]
  %exitcond = icmp eq i32 %inc, %argc             ; <i1> [#uses=1]
  br i1 %exitcond, label %for.end, label %for.body

for.end:                                          ; preds = %for.body, %entry
  ret i32 0
}

declare i32 @printf(i8* nocapture, ...) nounwind

Without static:

; ModuleID = '/tmp/webcompile/_27859_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

@.str = private constant [6 x i8] c"%d %d\00"     ; <[6 x i8]*> [#uses=1]

define void @foo(int)(i32 %i) nounwind {
entry:
  %rem = srem i32 %i, 3                           ; <i32> [#uses=1]
  %call = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([6 x i8]* @.str, i64 0, i64 0), i32 %i, i32 %rem) ; <i32> [#uses=0]
  ret void
}

declare i32 @printf(i8* nocapture, ...) nounwind

define i32 @main(i32 %argc, i8** nocapture %argv) nounwind {
entry:
  %cmp4 = icmp eq i32 %argc, 0                    ; <i1> [#uses=1]
  br i1 %cmp4, label %for.end, label %for.body

for.body:                                         ; preds = %for.body, %entry
  %0 = phi i32 [ %inc, %for.body ], [ 0, %entry ] ; <i32> [#uses=3]
  %rem.i = srem i32 %0, 3                         ; <i32> [#uses=1]
  %call.i = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([6 x i8]* @.str, i64 0, i64 0), i32 %0, i32 %rem.i) nounwind ; <i32> [#uses=0]
  %inc = add nsw i32 %0, 1                        ; <i32> [#uses=2]
  %exitcond = icmp eq i32 %inc, %argc             ; <i1> [#uses=1]
  br i1 %exitcond, label %for.end, label %for.body

for.end:                                          ; preds = %for.body, %entry
  ret i32 0
}

I think you ought to put them in an unnamed namespace, according to "principle". — GManNickG, Aug 17 '11 at 17:53
Clang inlined it in both the static and nonstatic versions in your example... I don't think there's a significant difference in whether or not it inlines the call. There's a difference in whether or not the function's code will be emitted or not. — Johannes Schaub - litb, Aug 17 '11 at 19:46
@GMan: I favor `static` because I find it more readable for a human being, an unnamed namespace require the human to remember he's in one, and it isn't obvious. `static` does not require this "context". — Matthieu M., Aug 18 '11 at 06:11
@Matthieu I must be missing something. I *am* looking at your demo, and it has inlined the call. There is no call to `foo` in the entirety of `main`. — Johannes Schaub - litb, Aug 18 '11 at 19:16

score 7 · Answer 2 · answered Aug 17 '11 at 16:44

7

Depends on aliasing (pointers to that function) and function length (a large function inlined in a branch could throw the other branch out of cache, thus hurting performance).

Let the compiler worry about that, you worry about your code :)

answered Aug 17 '11 at 16:44

Blindy

65,249
10
91
131

1

Could a compiler also split up longer functions itself into smaller functions? – Cookie Aug 17 '11 at 16:46
The OP is asking about the comparison between writing the code as above, and writing all the code inline manually. So I guess he/she wants an answer that explains whether breaking a function up will ever have a downside. – Oliver Charlesworth Aug 17 '11 at 16:46
That sounds a bit less likely, what would be the benefit? I think you're more likely to find the compiler copy your blocks over and over one after the other :) – Blindy Aug 17 '11 at 16:47
@Oli, yes, and my reply is basically it doesn't matter. – Blindy Aug 17 '11 at 16:48
1

@Blindy: Your answer says "don't worry about it", which isn't quite the same thing! There are clearly cases where function-call overhead will significantly impact performance; the OP wants to know whether he'll ever be at risk of incurring this cost if he writes his code for clarity. – Oliver Charlesworth Aug 17 '11 at 16:49
Well, "don't worry about it" is actually more correct. I'm sure it *does* matter for that one or two cycles out of the hundred billion or so a modern CPU can do every second... – Blindy Aug 17 '11 at 16:50
2

@Blindy: If this code forms part of the inner-loop that iterates millions of times a second, then yes, it really *does* matter. – Oliver Charlesworth Aug 17 '11 at 16:52
Doubt the overhead of calling *four* functions instead of one matters compared to the amount of work the OP implies is being done. Seriously, a function call is literally a couple of pushes, that's peanuts! – Blindy Aug 17 '11 at 16:53
@Blindy: I think you're deliberately avoiding the point. Clearly, the OP's snippet is just an illustrative example. If you're writing a tight numerical routine (FFT, SVD, matrix multiply, filter, whatever), then that couple of extra cycles could quickly add up, depending on how many functions you break your monolithic implementation into. – Oliver Charlesworth Aug 17 '11 at 16:55
You're guessing as much as me, only on the opposite side of the scale. Unless we get more detailed usage information, I think my answer is fairly accurate. – Blindy Aug 17 '11 at 16:57
@Blindy: The OP's question is sufficiently well-phrased for an accurate yes/no answer. Either "no, you will never incur a cost" or "yes, sometimes the resulting machine code will be slower". Your answer is somewhere in between "yes" and "I don't know". – Oliver Charlesworth Aug 17 '11 at 17:00
No my answer is "I don't care because the cost will be negligible". Besides if the functions he's splitting his code into are so many that they start hurting performance, the compiler will start inlining them into eachother, thus reaching a balance all on its own. – Blindy Aug 17 '11 at 17:01
With all due respect, I'm going to give this answer -1. "I don't care" is not an answer to the OP's question. – Oliver Charlesworth Aug 17 '11 at 17:11
The large function has a switch statement, so each case could be a small function and there are just a lot of them with only 1 being executed. Or if many small functions are called, the overhead would be multiplied. Either way failure to inline could be bad for performance. – phkahler Aug 17 '11 at 17:42

Mark Ransom · Answer 3 · 2011-08-17T16:56:48.607

7

A complicated function is likely to have its speed dominated by the operations within the function; the overhead of a function call won't be noticeable even if it isn't inlined.

You don't have much control over the inlining of a function, the best way to know is to try it and find out.

A compiler's optimizer might be more effective with shorter pieces of code, so you might find it getting faster even if it's not inlined.

edited Aug 17 '11 at 16:56

answered Aug 17 '11 at 16:47

Mark Ransom

299,747
42
398
622

And what about a simple function? – Oliver Charlesworth Aug 17 '11 at 16:52
@Oli, simple functions are more likely to be automatically inlined anyway. – Mark Ransom Aug 17 '11 at 16:55
Don't have much control? That's why there is an "inline" keyword... Simple functions can not be inlined unless they are declared static. Or more specifically a non-inlined version must be available to outside callers. Also, if a large function is broken into 15 small ones (like OP wants to do), that will be 15 times the call overhead. – phkahler Aug 17 '11 at 17:48
2

@phkahler: The inline keyword doesn't force the compiler to inline the function. – Bill Aug 17 '11 at 18:01
@Bill: Right, it's just a suggestion which most compilers listen too (unlike the old register keyword). The fact that it exists and is used refutes the posters second sentence that says you don't have much control. – phkahler Aug 22 '11 at 13:49

score 0 · Answer 4 · answered Aug 17 '11 at 17:05

0

If you split up your code into logical groupings the compiler will do what it deems best: If it's short and easy, the compiler should inline it and the result is the same. If however the code is complicated, making an extra function call might actually be faster than doing all the work inlined, so you leave the compiler the option to do that too. On top of all that, the logically split code can be far easier for a maintainer to grok and avoid future bugs.

answered Aug 17 '11 at 17:05

Mark B

95,107
10
109
188

If the function is called only once, I am unsure that we would see a speed up by not inlining the code... *unless* you actually want to skip the function call and having the function inlined therefore trash your instruction cache. Is it current ? – Matthieu M. Aug 17 '11 at 17:09
My thought was that by breaking the code down into logical function blocks the compiler may be able to utilize registers more intelligently, among other possibilities I can't even imagine. – Mark B Aug 17 '11 at 18:04

score 0 · Answer 5 · answered Aug 17 '11 at 20:46

I suggest you create a helper class to break your complicated function into method calls, much like you were proposing, but without the long, boring and unreadable task of passing arguments to each and every one of these smaller functions. Pass these arguments only once by making them member variables of the helper class.

Don't focus on optimization at this point, make sure your code is readable and you'll be fine 99% of the time.

Do modern C++ compilers inline functions which are called exactly once?

5 Answers5

Linked