Why is it not cost effective to inline functions with loops or switch statements?

Question

I noticed that Google's C++ style guide cautions against inlining functions with loops or switch statements:

Another useful rule of thumb: it's typically not cost effective to inline functions with loops or switch statements (unless, in the common case, the loop or switch statement is never executed).

Other comments on StackOverflow have reiterated this sentiment.

Why are functions with loops or switch statements (or gotos) not suitable for or compatible with inlining. Does this apply to functions that contain any type of jump? Does it apply to functions with if statements? Also (and this might be somewhat unrelated), why is inlining functions that return a value discouraged?

I am particularly interested in this question because I am working with a segment of performance-sensitive code. I noticed that after inlining a function that contains a series of if statements, performance degrades pretty significantly. I'm using GNU Make 3.81, if that's relevant.

I'd recommend leaving inlining decisions to the compiler - and so do compiler-writers, who happily ignore which functions the programmer declared `inline`. — EOF, Mar 24 '15 at 01:21
_"I'm using GNU Make 3.81, if that's relevant."_ The more relevant portion might be the C++ compiler implementation used. — πάντα ῥεῖ, Mar 24 '15 at 01:21
Inline is generally within 5 lines of code optimization, cycle and switch statements usually have a large number of logic. And if you need to optimize, the rule of 80/20, find out the pathogeny, careful optimization. Everything after optimization, considering inline — Ron Tang, Mar 24 '15 at 01:23
I've inspected a significant sample of our code as instruction dumps in kdb, and from my experience, our compiler is actually inlining functions that are labelled inline, and explicitly calling functions that are not inline. For my case I think I can assume that this process is relatively straightforward. — olliezhu, Mar 24 '15 at 01:32
@uhwuggawuh: again, which compiler, which architecture? The architecture matters: On archs with stack-based calling conventions (x86 stdcall), the function call has quite high inherent cost. On register-based calling conventions (x86-64, MIPS, x86 fastcall), inlining mostly allows reordering of instructions across the function call, and may eliminate redundant calculations. Some of those advantages can also be gained by constant-propagation. — EOF, Mar 24 '15 at 01:38
@EOF: Some of the code runs on x86-64 architecture, and some of the code is firmware for a network processor architecture (that has register-based calling convention). So in this case, you're saying that when I inline functions I will not be seeing significant improvements associated with eliminating function call overhead? — olliezhu, Mar 24 '15 at 02:07
@EOF I agree with your sentiment but not your statements that compilers ignore inline. — Neil Kirk, Mar 24 '15 at 02:16

score 17 · Accepted Answer · answered Mar 24 '15 at 01:24

17

Inlining functions with conditional branches makes it more difficult for the CPU to accurately predict the branch statements, since each instance of the branch is independent.

If there are several branch statements, successful branch prediction saves a lot more cycles than the cost of calling the function.

Similar logic applies to unrolling loops with switch statements.

The Google guide referenced doesn't mention anything about functions returning values, so I'm assuming that reference is elsewhere, and requires a different question with an explicit citation.

answered Mar 24 '15 at 01:24

rici

234,347
28
237
341

Yes, the reference to functions returning values was just something I was seeing around StackOverflow on the topic of inline functions, often in the same breath as the warnings against inlining functions with loops and switch statements. – olliezhu Mar 24 '15 at 01:27
3

@uhwuggawuh: The reference you gave for that is a 0-point answer by a 1-reputation author... – EOF Mar 24 '15 at 01:29
@rici: Are you saying that the reason that inlining is not cost effective in these situations is because the performance bottleneck will be in the relatively cycle-heavy process of looping or branching? – olliezhu Mar 24 '15 at 01:40
1

@uhwuggawuh: No. I'm saying that not inlining will make it more likely that the CPU will correctly predict branches, making the loops and other branches significantly faster. – rici Mar 24 '15 at 01:42
1

@rici: unless calling the same function in different places with very different arguments ends up confusing the branch predictor... – EOF Mar 24 '15 at 01:43
@EOF: That was not the only case where I saw this being said; there are several other threads on StackOverflow that reiterated this. e.g. https://stackoverflow.com/questions/60830/what-is-wrong-with-using-inline-functions – olliezhu Mar 24 '15 at 01:45
@uhwuggawuh: The citation in SO you give (aside from its other weaknesses) doesn't say you shouldn't inline functions which include those four features. It says that the compiler may refuse to inline such functions. That's true only because the compiler may always refuse to inline a function, and it is not obliged to tell you why. – rici Mar 24 '15 at 01:46
@EOF: I only said "more likely". – rici Mar 24 '15 at 01:49
@rici: Thanks for clarifying. Could you also expand on how this is similarly applies with loop unrolling? Are you saying that it is also more difficult for the compiler to unroll loops to improve performance if the function is inlined? – olliezhu Mar 24 '15 at 02:24
1

@uhwuggawuh: Most branch predictors build up a history of past branches, and use the virtual address of the branch to find its history. If you unroll a loop containing `switch` then you duplicate all the branches in the `switch`, making the CPU take longer to get good branch history and also consuming more of the CPU's (limited number of) "branch prediction slots". – Brendan Mar 24 '15 at 02:56
@uhwuggawuh: Also; be aware that (in general, for all optimisation advice) corner cases were the advice is wrong are possible. A simple example would be inlining a function containing `switch` where most callers use constant arguments (and where the compiler can use constant folding to remove branches if its been inlined). – Brendan Mar 24 '15 at 03:02
@Brendan: but in such a case, a compiler may be more likely to inline, regardless of whether or not requested to. – rici Mar 24 '15 at 03:05

score 3 · Answer 2 · answered Mar 25 '15 at 11:43

While in your case, the performance degradation seems to be caused by branch mispredictions, I don't think that's the reason why the Google style guide advocates against inline functions containing loops or switch statements. There are use cases where the branch predictor can benefit from inlining.

A loop is often executed hundreds of times, so the execution time of the loop is much larger than the time saved by inlining. So the performance benefit is negligible (see Amdahl's law). OTOH, inlining functions results in increase of code size which has negative effects on the instruction cache.

In the case of switch statements, I can only guess. The rationale might be that jump tables can be rather large, wasting much more memory in the code segment than is obvious.

I think the keyword here is cost effective. Functions that cost a lot of cycles or memory are typically not worth inlining.

score 3 · Answer 3 · answered Mar 25 '15 at 14:41

The purpose of a coding style guide is to tell you that if you are reading it you are unlikely to have added an optimisation to a real compiler, even less likely to have added a useful optimisation (measured by other people on realistic programs over a range of CPUs), therefore quite unlikely to be able to out-guess the guys who did. At least, do not mislead them, for example, by putting the volatile keyword in front of all your variables.

Inlining decisions in a compiler have very little to do with 'Making a Simple Branch Predictor Happy'. Or less confused.

First off, the target CPU may not even have branch prediction.

Second, a concrete example:

Imagine a compiler which has no other optimisation (turned on) except inlining. Then the only positive effect of inlining a function is that bookkeeping related to function calls (saving registers, setting up locals, saving the return address, and jumping to and back) are eliminated. The cost is duplicating code at every single location where the function is called.

In a real compiler dozens of other simple optimisations are done and the hope of inlining decisions is that those optimisations will interact (or cascade) nicely. Here is a very simple example:

int f(int s)
{
 ...;
 switch (s) {
   case 1: ...; break;
   case 2: ...; break;
   case 42: ...; return ...;
 }
 return ...;
}

void g(...)
{
  int x=f(42);
  ...
}

When the compiler decides to inline f, it replaces the RHS of the assignment with the body of f. It substitutes the actual parameter 42 for the formal parameter s and suddenly it finds that the switch is on a constant value...so it drops all the other branches and hopefully the known value will allow further simplifications (ie they cascade).

If you are really lucky all calls to the function will be inlined (and unless f is visible outside) the original f will completely disappear from your code. So your compiler eliminated all the bookkeeping and made your code smaller at compile time. And made the code more local at runtime.

If you are unlucky, the code size grows, locality at runtime decreases and your code runs slower.

It is trickier to give a nice example when it is beneficial to inline loops because one has to assume other optimisations and the interactions between them.

The point is that it is hellishly difficult to predict what happens to a chunk of code even if you know all the ways the compiler is allowed to change it. I can't remember who said it but one should not be able to recognise the executable code produced by an optimising compiler.

I wish there were a standard-defined way to have code specify that a certain algorithm should be used if it could be in-lined with certain values as constants and a different algorithm used in other cases. For example, code to bit-reverse a 32-bit value could be implemented as a combination of masks and shifts, a loop, or a call to an optimized piece of machine-code. If the value to be reversed is a constant, the combination of masks and shifts would compile down to a constant, but using the masks and shifts on a variable would yield a bloated mess which may be slower than a loop. — supercat, Jul 25 '15 at 20:45
Some compilers might be able to recognize that a loop written in C, given a constant input, would yield a constant output, but many would not. The machine code could be twice as fast as what a typical compiler could produce, but even when the input is constant no compiler would likely figure out that the output was constant as well. — supercat, Jul 25 '15 at 20:48
I know where you are going with this :). I think, if you are very keen, you can get the 32-bit reverse example to compile to different algorithms (and then to constants) by using the tempo partial evaluator (http://phoenix.inria.fr/software/past-projects/tempo). It is ancient but you will make a lot of old people happy. — user1666959, Jul 26 '15 at 19:38

score 1 · Answer 4 · answered Nov 12 '20 at 00:14

I think it might be worth to extend the example provided by @user1666959. I'll answer to provide cleaner example code. Let's consider such scenario.

/// Counts odd numbers in range [0;number]
size_t countOdd(size_t number)
{
    size_t result = 0;
    for (size_t i = 0; i <= number; ++i)
    {
        result += (i % 2);
    }
    return result;
}

int main()
{
    return countOdd(5);
}

If the function is not inlined and uses external linking, it will execute whole loop. Imagine what happens when you inline it.

int main()
{
    size_t result = 0;
    for (size_t i = 0; i <= 5; ++i)
    {
        result += (i % 2);
    }
    return result;
}

Now let's enable loop unfolding optimization. Here we know that it iterates from 0 to 5, so it can be easily unfolded removing unwanted conditions in the code.

int main()
{
    size_t result = 0;
    // iteration 0
    size_t i = 0
    result += (i % 2);
    // iteration 1
    ++i
    result += (i % 2);
    // iteration 2
    ++i
    result += (i % 2);
    // iteration 3
    ++i
    result += (i % 2);
    // iteration 4
    ++i
    result += (i % 2);
    // iteration 5
    ++i
    result += (i % 2);
    return result;
}

No conditions, it is faster already but that's not all. We know the value of i, so why not passing it directly?

int main()
{
    size_t result = 0;
    // iteration 0
    result += (0 % 2);
    // iteration 1
    result += (1 % 2);
    // iteration 2
    result += (2 % 2);
    // iteration 3
    result += (3 % 2);
    // iteration 4
    result += (4 % 2);
    // iteration 5
    result += (5 % 2);
    return result;
}

Even simpler but whait, those operations are constexpr, we can calculate them during compilation.

int main()
{
    size_t result = 0;
    // iteration 0
    result += 0;
    // iteration 1
    result += 1;
    // iteration 2
    result += 0;
    // iteration 3
    result += 1;
    // iteration 4
    result += 0;
    // iteration 5
    result += 1;
    return result;
}

So now the compiler sees that some of those operations don't have any effects leaving only those, which change the value. After that it removes unnecessary temporary variables and performs as much calculations, as it can during compilation, your code ends up with:

int main()
{
    return 3;
}

Why is it not cost effective to inline functions with loops or switch statements?

4 Answers4