Avoiding gcc function prologue overhead?

Question

I've lately encountered a lot of functions where gcc generates really bad code on x86. They all fit a pattern of:

if (some_condition) {
    /* do something really simple and return */
} else {
    /* something complex that needs lots of registers */
}

Think of simple case as something so small that half or more of the work is spent pushing and popping registers that won't be modified at all. If I were writing the asm by hand, I would save and restore the saved-across-calls registers inside the complex case, and avoid touching the stack pointer at all in the simple case.

Is there any way to get gcc to be a little bit smarter and do this itself? Preferably with command line options rather than ugly hacks in the source...

Edit: To make it concrete, here's something very close to some of the functions I'm dealing with:

if (buf->pos < buf->end) {
    return *buf->pos++;
} else {
    /* fill buffer */
}

and another one:

if (!initialized) {
    /* complex initialization procedure */
}
return &initialized_object;

and another:

if (mutex->type == SIMPLE) {
    return atomic_swap(&mutex->lock, 1);
} else {
    /* deal with ownership, etc. */
}

Edit 2: I should have mentioned to begin with: these functions cannot be inlined. They have external linkage and they're library code. Allowing them to be inlined in the application would result in all kinds of problems.

Either way, gcc puts the function prologue/epilogue (saving registers, adjusting stack alignment, etc.) outside of both cases, so the cost is incurred on both. — R.. GitHub STOP HELPING ICE, Mar 29 '11 at 19:23
Your samples lack any of the complicating bits. Are you suggesting that the compiler messes it up even with empty else blocks? — sehe, Mar 29 '11 at 19:35
OK, add `printf("hello, world\n");` to the empty blocks... Seriously it doesn't matter much what's there. If it makes one or more function calls, you'll incur stack alignment prologue, and if it uses a non-trivial amount of registers, you'll incur saving/restoring one or more of ebx/esi/edi/ebp. — R.. GitHub STOP HELPING ICE, Mar 29 '11 at 19:46
Also note: I've tried `__builtin_expect` and it makes no difference. — R.. GitHub STOP HELPING ICE, Mar 29 '11 at 19:47
What happens if you move the complex case into a separate static function, called from the simple function? — caf, Mar 30 '11 at 00:57
@caf: I've tried that, and with the default settings, gcc just inlines the function back in. If I make the function external or use `-fno-inline-functions-called-once` then the code gets better, but since modern gcc insists on keeping the stack 16-byte-aligned for function calls, it puts stack alignment prologue in place of the register saving prologue, then undoes the prologue immediately for the tail-call optimization to the complex function. :-P — R.. GitHub STOP HELPING ICE, Mar 30 '11 at 01:53

score 2 · Answer 1 · answered Sep 08 '12 at 10:59

I would do it like this:

static void complex_function() {}

void foo()
{
    if(simple_case) {
        // do whatever
        return;
    } else {
        complex_function();
    }
}

The compiler my insist on inlining complex_function(), in which case you can use the noinline attribute on it.

score 2 · Answer 2 · edited May 23 '17 at 11:52

2

Update

To explicitely suppress inlining for a single function in gcc, use:

void foo() __attribute__ ((noinline))
{
  ...
}

See also How can I tell gcc not to inline a function?

Functions like this will regularly be inlined automatically unless compiled -O0 (disable optimization).

In C++ you can hint the compiler using the inline keyword

If the compiler won't take your hint you are probably using too many registers/branches inside the function. The situation is almost certainly resolved by extracting the 'complicated' block into it's own function.

Update i noticed you added the fact that they are extern symbols. (Please update the question with that crucial info). Well, in a sense, with external functions, all bets are off. I cannot really believe that gcc will by definition inline all of a complex function into a tiny caller simply because it is only called from there. Perhaps you can give some sample code that demonstrates the behaviour and we can find the proper optimization flags to remedy that?

Also, is this C or C++? In C++ I know it is common place to include the trivial decision functions inline (mostly as members defined in the class declaration). This won't give a linkage conflict like with simple (extern) C functions.

Also you can have template functions defined that will inline perfectly in all compilation modules without resulting in link conflicts.

I hope you are using C++ because it will give you a ton of options here.

edited May 23 '17 at 11:52

Community

1
1

answered Mar 29 '11 at 19:16

sehe

374,641
47
450
633

1

No they won't, because of the complex part. – Blindy Mar 29 '11 at 19:18
And because they're external. – R.. GitHub STOP HELPING ICE Mar 29 '11 at 19:21
Also note that even if I move the 'complicated' branch into its own function, (1) gcc will move it right back because gcc inlines functions that are only called from one place, and (2) even if gcc didn't move it back, setting up for a function call is one of the causes of the ugly prologue code that's making the trivial case slow, and gcc won't put the setup just in one of the branches. – R.. GitHub STOP HELPING ICE Mar 29 '11 at 19:22
1

@R.. you're not really encouraging us to help you and think along. Please provide sample code so we can do actual analysis instead of just doing a 'I think'/'I suppose'/'Nah it would just' show. I'm starting to think you are really venting an opinion instead of asking a question – sehe Mar 29 '11 at 19:26
PS. I noticed only now that you mention the function being external. That is a pain. Updating my answer – sehe Mar 29 '11 at 19:28
About inlining functions only called once, see `-finline-functions-called-once` at http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html. I could experiment with turning this off (it's on by default) and making the complex case a separate function. – R.. GitHub STOP HELPING ICE Mar 29 '11 at 19:36
I tried this, and it avoids saving all four of ebx/esi/edi/ebp in the prologue, but it still aligns the stack before the conditional. `-mpreferred-stack-boundary=2` fixes that but it's an ugly hack. What's really stupid...gcc optimizes the function call to the complex case into a tail call (just a jmp) so it has to undo the stack adjustment before the jmp! – R.. GitHub STOP HELPING ICE Mar 29 '11 at 19:40
http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html has info on an noinline attribute that could be used as well – sehe Mar 29 '11 at 19:45
Is there a way to specify per-function (with attibutes) that it doesn't care if the stack is misaligned? If so, it would be possible to make a single attribute macro which would inhibit the gcc stupidity, and hide it away from the actual code... the only ugliness in the code would be splitting the complex case off into a separate function but that's not so bad. – R.. GitHub STOP HELPING ICE Mar 29 '11 at 19:52
It looks to me like `attribute((optimize("XXX")))` only supports `-O` levels and `-f` options, not `-m` options... I suppose I could make the misaligned call myself with asm, but that's getting into really horrible hacks... – R.. GitHub STOP HELPING ICE Mar 29 '11 at 20:09

Jens Gustedt · Accepted Answer · 2011-03-30T13:17:48.033

1

Perhaps upgrade your version of gcc? 4.6 has just been released. As far as I understand, it has the possibility of "partial inline". That is, an easily integratable outer part of a function is inlined and the expensive part is transformed into a call. But I have to admit that I didn't try it myself, yet.

Edit: The statement I was referring to from the ChangeLog:

Partial inlining is now supported and enabled by default at -O2 and greater. The feature can be controlled via -fpartial-inlining.

Partial inlining splits functions with short hot path to return. This allows more aggressive inlining of the hot path leading to better performance and often to code size reductions (because cold parts of functions are not duplicated).

...

Inlining when optimizing for size (either in cold regions of a program or when compiling with -Os) was improved to better handle C++ programs with larger abstraction penalty, leading to smaller and faster code.

edited Mar 30 '11 at 13:17

answered Mar 29 '11 at 22:06

Jens Gustedt

76,821
6
102
177

@Jens: Thanks for the idea, but inlining is not an option. This is an extern function in a library, and having half of it inlined in the caller would create a nasty implementation/version dependency in the caller. – R.. GitHub STOP HELPING ICE Mar 29 '11 at 22:56
@R.. In any case, `inline` or not it seems that the newer versions of gcc can thus split of a part of a function that is costly. Maybe by looking into the recent optimization improvements of gcc you'd find something useful. – Jens Gustedt Mar 30 '11 at 06:27
Is there a particular option I should look at? – R.. GitHub STOP HELPING ICE Mar 30 '11 at 12:33
That ("hot path" splitting) looks like the general type of optimization I need, but for use when not inlining. I wonder if gcc can apply it to functions that are not candidates for inlining. I might get 4.6 and try playing with it... but my guess is we'll have to wait for 4.7 or 4.8 before this functionality is useful to the case I described... :-( – R.. GitHub STOP HELPING ICE Mar 30 '11 at 13:43
@R.., you could always compile just one function per compilation unit. Tell the compiler to `inline` just that function in that unit but force the generation of the symbol, there. If your header files only contain prototypes and no definitions the compiler simply can't inline for real. My guess would be that gcc would construct the function with the "hot path" feature, thus splitting it in two. – Jens Gustedt Mar 30 '11 at 15:29
so taking that offline if you want – Jens Gustedt Mar 30 '11 at 21:11
There doesn't seem to be any non-hackish solution at present, but your answer is the closest thing to the right *direction* that might eventually lead to a solution as gcc advances, so... accepted! – R.. GitHub STOP HELPING ICE Apr 28 '11 at 04:00
Has there been any improvements on this? In either gcc or llvm? – Cedomir Segulja Sep 23 '14 at 16:19
@CedomirSegulja, I haven't investigate that recently. What I have recently done and which seemed to work is to put the complex case (e.g such a one-time initialization) into a `static` function. Then gcc is able to move the pushing and popping to that one-time function and have the real user interface function nice and clean. – Jens Gustedt Sep 23 '14 at 22:12

score 0 · Answer 4 · answered Mar 29 '11 at 19:27

0

I would probably refactor the code to encourage inlining of the simple case. That said, you can use -finline-limit to make gcc consider inlining larger functions, or -fomit-frame-pointer -fno-exceptions to minimize the stack frame. (Note that the latter may break debugging and cause C++ exceptions to misbehave badly.)

Probably you won't be able to get much from tweaking compiler options, though, and will have to refactor.

answered Mar 29 '11 at 19:27

geekosaur

59,309
11
123
114

Wow, I've had a lot of answers suggesting to do things to get gcc to inline better, when the question states that this an external function and inlining is not possible... – R.. GitHub STOP HELPING ICE Mar 30 '11 at 12:30
The inlining I'm talking about is, in the case of externals, wrapper functions or even macros. Catch the simple case(s) inline and invoke the external for the rest. This does involve testing the condition twice, but only int he case that's expensive anyway so it's down in the noise. This isn't rocket science. – geekosaur Mar 30 '11 at 17:16
Again, this isn't possible in library code, unless you want to write **one of *those* libraries** where every single version is incompatible because all the internals of supposedly-opaque library structures got inlined into the calling application. If gcc weren't messing up the simple cases, my external functions would be 80-90% as good as macros or inline functions, but instead they're only 50-70% as good, because gcc is adding lots of unnecessary overhead. – R.. GitHub STOP HELPING ICE Mar 30 '11 at 17:23
You are busily building a contradiction here. If it's possible to change the function linkage then it's possible to interpose a macro. If you can't interpose a macro then it's too late to change the function linkage. – geekosaur Mar 30 '11 at 17:24
I'm not trying to use macros or inline functions, at all. I'm trying to write a normal `extern`-linkage function in a library that has a short, fast "hot path" (thanks to Jens for the word) with minimal entry/exit overhead and a much larger branch that's executed only once or rarely. – R.. GitHub STOP HELPING ICE Mar 30 '11 at 17:27
And that's the wrong way to do it. Refactoring is the correct way to do it. Otherwise... well, if you're trying to write it, the way to write it is a small `asm` stub that invokes a larger full function if needed — which is the exact same refactoring done the hard way. – geekosaur Mar 30 '11 at 17:31
Have you read the comments on the other posts? Refactoring helps *some* but not much. By default, gcc inlines the factored-out complex path right back in, if it's `static` linkage. This can be inhibited by making it `extern` or by using `-fno-inline-functions-called-once`, and then the generated code is *not horrible*, except that it still has prologue/epilogue for aligning the stack to 16 bytes, since it *might* make a function call. Disabling that with `-mpreferred-stack-boundary=2` does give the correct, optimal code, but I'm looking for something less hackish... – R.. GitHub STOP HELPING ICE Mar 30 '11 at 17:35
I saw that, yes. But if refactoring doesn't help then neither will your magic linkage, or else you're doing it wrong. Using a macro for the hot path should involve *zero* linkage. – geekosaur Mar 30 '11 at 17:39
Macros are not an option. Ideally there would be a gcc option `-fhot-paths` enabled by default in all `-O` levels that would identify any branch paths through a function that involve no function calls and sufficiently low register pressure that no registers need to be saved on the stack, and gcc would move the overhead inside the other branches. But it looks like I'm stuck with either asm wrappers, a pile of hacks gcc option hacks and arbitrary compiler-optimizer-specific factoring, or moderately-slow code... – R.. GitHub STOP HELPING ICE Mar 30 '11 at 17:46
"Not an option"? You're still building a tower of nonsense constraints. If you *insist* on making life hard for yourself, then sure enough: life is hard. – geekosaur Mar 30 '11 at 17:49
Macros are not an option for a shared library. This is not something I made up. It's fundamental. Letting a macro get inlined into the *application* means there are not two bodies of code that must agree on the definition and semantics of the data structures involved: the macro (static) linked into the application, and the code in the (shared) library. Now either you have to lock your library forever into the same implementation of the data structures (which hurts performance more than a bad function prologue) or break compatibility and force everyone to carry around multiple library versions. – R.. GitHub STOP HELPING ICE Mar 30 '11 at 17:54
If you're providing a shared library without a header telling the compiler the right types etc., you're already making a fundamental mistake. Once you provide that header, wrapper macros are free. – geekosaur Mar 30 '11 at 17:58
I think you have a serious misunderstanding here. There's a big difference between a header that provides prototypes for *interfaces* that are designed to be a permanent part of the API (and ABI) for the library, and a header that provides *half-implementations* with dependencies on the internals of library-opaque data structures. Note that in a good library, the `struct` definitions will not even be in the public header. They'll all be incomplete types with definitions in a library-internal header that does not get installed. – R.. GitHub STOP HELPING ICE Mar 30 '11 at 18:21
I think your misunderstanding is that you need to hide every little thing. You are insisting that the standard way of doing this is fundamentally wrong in some pointless way. Again: if you insist on making life hard, sure enough it's hard. Stop building artificial roadblocks. – geekosaur Mar 30 '11 at 18:23
And I think you misunderstand that I'm willing to write crap code that has serious system-maintenance costs for the sake of making the most intense usage cases 30% faster and typical usage 10% faster. I'm not. This kind of behavior may be acceptable if your library is part of a game that needs max speed, but it's not acceptable when you have thousands or tens of thousands of binaries dynamic-linked to it (not to mention other shared libraries). – R.. GitHub STOP HELPING ICE Mar 30 '11 at 18:32
A macro for the fast path is, last I checked, best practice — not crap code. But since you have your own preferred world, go hack assembler. I'm done. – geekosaur Mar 30 '11 at 18:34

score 0 · Answer 5 · answered Mar 30 '11 at 03:35

0

Seeing as these are external calls, it might be possible the gcc is treating them as unsafe and preserving registers for the function call(hard to know without seeing the registers that it preserves, including the ones you say 'aren't used'). Out of curiousity, does this excessive register spilling still occur with all optimizations disabled?

answered Mar 30 '11 at 03:35

Necrolis

25,836
3
63
101

On x86 ABI, ebx, ebp, esi, and edi are required to be saved by the callee if it clobbers them. This is correct. The problem is that gcc unconditionally does all the work of saving and restoring them at function entry and exit, even when the simple case never clobbers them. If it moved the saving and restoring inside the complex branch, the simple branch would be half (or fewer) as many instructions and significantly faster. – R.. GitHub STOP HELPING ICE Mar 30 '11 at 12:32
@R..: You can tweak that behavior a bit but there isn't much fine grained control — and if these functions are `extern` and not under your control, how do you know there are unused registers? – geekosaur Mar 30 '11 at 17:22
@geekosaur: This question has nothing to do with code generation for the *caller*, which of course cannot know what registers the callee uses. This whole topic is about code generation for the *callee*, which absolutely knows what registers it uses. – R.. GitHub STOP HELPING ICE Mar 30 '11 at 18:23
No, it only *absolutely* knows if you write it in assembler or you insist on a particular compiler version and optimization level so that you know with certainty the generated code. In any case, I'm done with this question; you're clearly stuck in a world view in which the only possible answer is "you can't". – geekosaur Mar 30 '11 at 18:27
And you're clearly only interested in giving me "advice" along the lines of "throw other much-more-important considerations out the window and do what I say" rather than trying to answer the question. – R.. GitHub STOP HELPING ICE Mar 30 '11 at 18:33

Avoiding gcc function prologue overhead?

5 Answers5

Linked