large performance drop with gcc, maybe related to inline

Question

I'm currently experiencing some weird effect with gcc (tested version: 4.8.4).

I've got a performance oriented code, which runs pretty fast. Its speed depends for a large part on inlining many small functions.

Since inlining across multiple .c files is difficult (-flto is not yet widely available), I've kept a lot of small functions (typically 1 to 5 lines of code each) into a common C file, into which I'm developing a codec, and its associated decoder. It's "relatively" large by my standard (about ~2000 lines, although a lot of them are just comments and blank lines), but breaking it into smaller parts opens new problems, so I would prefer to avoid that, if that is possible.

Encoder and Decoder are related, since they are inverse operations. But from a programming perspective, they are completely separated, sharing nothing in common, except a few typedef and very low-level functions (such as reading from unaligned memory position).

The strange effect is this one:

I recently added a new function fnew to the encoder side. It's a new "entry point". It's not used nor called from anywhere within the .c file.

The simple fact that it exists makes the performance of the decoder function fdec drops substantially, by more than 20%, which is way too much to be ignored.

Now, keep in mind than encoding and decoding operations are completely separated, and share almost nothing, save some minor typedef (u32, u16 and such) and associated operations (read/write).

When defining the new encoding function fnew as static, performance of the decoder fdec increases back to normal. Since fnew isn't called from the .c, I guess it's the same as if it was not there (dead code elimination).

If static fnew is now called from the encoder side, performance of fdec remains strong.

But as soon as fnew is modified, fdec performance just drops substantially.

Presuming fnew modifications crossed a threshold, I increased the following gcc parameter: --param max-inline-insns-auto=60 (by default, its value is supposed to be 40.) And it worked : performance of fdec is now back to normal.

And I guess this game will continue forever with each little modification of fnew or anything else similar, requiring further tweak.

This is just plain weird. There is no logical reason for some little modification in function fnew to have knock-on effect on completely unrelated function fdec, which only relation is to be in the same file.

The only tentative explanation I could invent so far is that maybe the simple presence of fnew is enough to cross some kind of global file threshold which would impact fdec. fnew can be made "not present" when it's: 1. not there, 2. static but not called from anywhere 3. static and small enough to be inlined. But it's just hiding the problem. Does that mean I can't add any new function?

Really, I couldn't find any satisfying explanation anywhere on the net.

I was curious to know if someone already experienced some equivalent side-effect, and found a solution to it.

[Edit]

Let's go for some more crazy test. Now I'm adding another completely useless function, just to play with. Its content is strictly exactly a copy-paste of fnew, but the name of the function is obviously different, so let's call it wtf.

When wtf exists, it doesn't matter if fnew is static or not, nor what is the value of max-inline-insns-auto: performance of fdec is back to normal. Even though wtf is not used nor called from anywhere... :'(

[Edit 2] there is no inline instruction. All functions are either normal or static. Inlining decision is solely within compiler's realm, which has worked fine so far.

[Edit 3] As suggested by Peter Cordes, the issue is not related to inline, but to instruction alignment. On newer Intel cpus (Sandy Bridge and later), hot loop benefit from being aligned on 32-bytes boundaries. Problem is, by default, gcc align them on 16-bytes boundaries. Which gives a 50% chance to be on proper alignment depending on length of previous code. Hence a difficult to understand issue, which "looks random".

Not all loop are sensitive. It only matters for critical loops, and only if their length make them cross one more 32-bytes instruction segment when being less ideally aligned.

Could it be that with a lot of inlining your code just suddenly starts thrashing the cache just to load the instructions? And the `fnew` function gets generated by gcc in a place which causes the decoder to get more cache misses. What hints that to me is that defining it static allows the compiler to make stronger assumptions from where it will be called and co-locate it there. My experience from a large project (several MB binary) has been that I could get large performance improvements from deinlining functions just to shrink cache usage by code. — Art, Sep 02 '15 at 12:38
Can you make a minimal self-contained example so we can toy around with this, too? This is really interesting. — fuz, Sep 02 '15 at 12:39
@FUZxxl : unfortunately no. While modifications lead to reproducible results, the relation between modification and effect seems totally random. Which makes it impossible to consciously produce a minimal self-contained example... — Cyan, Sep 02 '15 at 12:57
@Art : maybe. Note that I'm not forcibly inlining anything, I'm letting the compiler decide, because in most circumstances, functions to inline are very small (1-5 lines), so it seems only common sense. I also tried to look at the assembly generated. The "slow" and "fast" versions of `fdec` look surprisingly similar. There must be some subtle difference somewhere, but the generated assembly is way too large for me to analyze and find such subtle thing. — Cyan, Sep 02 '15 at 13:02
@Cyan You should consider reporting a compiler bug or at least asking this question on the gcc mailing list. Does the problem persist if you switch to clang? — fuz, Sep 02 '15 at 13:03
@FUZxxl : yes, that could be something for the gcc team. Performance with clang is bad no matter what, but I know clang is more conservative with auto-inlining, and plan to look into it later into the development process. For the time being, the currently described issue is expected to be gcc-related, strictly. — Cyan, Sep 02 '15 at 13:07
@Cyan "it seems only common sense". You'd be surprised. I once deinlined a two line lock function that could be compiled down to a handful of instructions, shrunk the program by 100kB and improved overall performance by around 10%. This was over a decade ago, so hopefully gcc has improved somewhat, but in my experience all the settings in compilers are tuned for winning microbenchmarks that usually don't need to care about cache usage by instructions. Not saying that this happens here, but the compiler is often not as smart as we'd like it to be. — Art, Sep 02 '15 at 13:21
@Art good point, but since there is no "inline" statenent in this particular piece of code, I don't see what more could be done to refrain gcc from inlining it. Plus, in most cases, inlining is really wanted and good for performance. — Cyan, Sep 02 '15 at 13:29
@Cyan: Can you put the code on github or something, if it's not closed source? External links aren't ideal, but they're better than the hand-waving in the question. I suspect there's an alignment / cache-line issue. Changes to the inlining parameter will change the size of functions, and bump other code over by multiples 16B. This is a similar effect to changing the contents of a function: it gets bigger and changes the alignment of other functions. — Peter Cordes, Sep 02 '15 at 13:45
Look at performance counters for the "slow" and "fast" versions, and see if any are different. e.g. a lot more cache misses could indicate false sharing of a cache line between threads. Also, profile and see if there's a new hotspot in the "slow" version. — Peter Cordes, Sep 02 '15 at 13:47
@Peter : I was expecting the compiler to always make sure a function starts at ideal aligned position, using noop to fill gaps. — Cyan, Sep 02 '15 at 14:28
@PeterCordes : I've been compiling the target source code with option `-falign-functions=64`, which if I understand properly force functions to be aligned on 64-bytes, hence cache lines. Performance of the slow version does not change, remains slow. But surprisingly, the fast version becomes slow with this option. Removing it gives back initial performance. — Cyan, Sep 02 '15 at 15:54
I'm leaning towards an alignment issue, although a fairly complex one. Are there tools to consult alignment of code ? Are there docs which explain how it works at cpu level regarding instruction alignment ? — Cyan, Sep 02 '15 at 16:55
@Cyan: Converted my recent comments to an answer. Continue discussion there. — Peter Cordes, Sep 02 '15 at 20:42
I might miss it, but what compiler flags do you specify? Is there `-O2` or `-O3`? — Maxim Egorushkin, Sep 02 '15 at 21:37

score 2 · Accepted Answer · edited May 23 '17 at 12:29

Turning my comments into an answer, because it was turning into a long discussion. Discussion showed that the performance problem is sensitive to alignment.

There are links to some perf-tuning info at https://stackoverflow.com/tags/x86/info, include Intel's optimization guide, and Agner Fog's very excellent stuff. Some of Agner Fog's assembly optimization advice doesn't fully apply to Sandybridge and later CPUs. If you want the low-level details on a specific CPU, though, the microarch guide is very good.

Without at least an external link to code that I can try myself, I can't do more than handwave. If you don't post the code anywher, you're going to need to use profiling / CPU performance counter tools like Linux perf or Intel VTune to track this down in a reasonable amount of time.

In chat, the OP found someone else having this issue, but with code posted. This is probably the same issue the OP is seeing, and is one of the major ways code alignment matters for Sandybridge-style uop caches.

There's a 32B boundary in the middle of the loop in the slow version. The instructions that start before the boundary decode to 5 uops. So in the first cycle, the uop cache serves up mov/add/movzbl/mov. In the 2nd cycle, there's only a single mov uop left in the current cache line. Then the 3rd cycle cycle issues the last 2 uops of the loop: add and cmp+ja.

The problematic mov starts at 0x..ff. I guess instructions that span a 32B boundary go into (one of) the uop cacheline(s) for their starting address.

In the fast version, an iteration only takes 2 cycles to issue: The same first cycle, then mov / add / cmp+ja in the 2nd.

If one of the first 4 instructions had been one byte longer (e.g. padded with a useless prefix, or a REX prefix), there would be no problem. There wouldn't be an odd-man-out at the end of the first cacheline, because the mov would start after the 32B boundary and be part of the next uop cache line.

AFAIK, assemble & check disassembly output is the only way to use longer versions of the same instructions (see Agner Fog's Optimizing Assembly) to get 32B boundaries at multiples of 4 uops. I'm not aware of a GUI that shows alignment of assembled code as you're editing. (And obviously, doing this only works for hand-written asm, and is brittle. Changing the code at all will break the hand-alignment.)

This is why Intel's optimization guide recommends aligning critical loops to 32B.

It would be really cool if an assembler had a way to request that preceding instructions be assembled using longer encodings to pad out to a certain length. Maybe a .startencodealign / .endencodealign 32 pair of directives, to apply padding to code between the directives to make it end on a 32B boundary. This could make terrible code if used badly, though.

Changes to the inlining parameter will change the size of functions, and bump other code over by multiples 16B. This is a similar effect to changing the contents of a function: it gets bigger and changes the alignment of other functions.

I was expecting the compiler to always make sure a function starts at ideal aligned position, using noop to fill gaps.

There's a tradeoff. It would hurt performance to align every function to 64B (the start of a cache line). Code density would go down, with more cache lines needed to hold the instructions. 16B is good, because it's the instruction fetch/decode chunk size on most recent CPUs.

Agner Fog has the low-level details for each microarch. He hasn't updated it for Broadwell, though, but the uop cache probably hasn't changed since Sandybridge. I assume there's one fairly small loop that dominates the runtime. I'm not sure exactly what to look for first. Maybe the "slow" version has some branch targets near the end of a 32B block of code (and hence near the end of a uop cacheline), leading to significantly less than 4 uops per clock coming out of the frontend.

Look at performance counters for the "slow" and "fast" versions (e.g. with perf stat ./cmd), and see if any are different. e.g. a lot more cache misses could indicate false sharing of a cache line between threads. Also, profile and see if there's a new hotspot in the "slow" version. (e.g. with perf record ./cmd && perf report on Linux).

How many uops/clock is the "fast" version getting? If it's above 3, frontend bottlenecks (maybe in the uop cache) that are sensitive to alignment could be the issue. Either that or L1 / uop-cache misses if different alignment means your code needs more cache lines than are available.

Anyway, this bears repeating: use a profiler / performance counters to find the new bottleneck that the "slow" version has, but the "fast" version doesn't. Then you can spend time looking at the disassembly of that block of code. (Don't look at gcc's asm output. You need to see the alignment in the disassembly of the final binary.) Look at the 16B and 32B boundaries, since presumably they'll be in different places between the two versions, and we think that's the cause of the problem.

Alignment can also make macro-fusion fail, if a compare/jcc splits a 16B boundary exactly. Although that is unlikely in your case, since your functions are always aligned to some multiple of 16B.

re: automated tools for alignment: no, I'm not aware of anything that can look at a binary and tell you anything useful about alignment. I wish there was an editor to show groups of 4 uops and 32B boundaries alongside your code, and update as you edit.

Intel's IACA can sometimes be useful for analyzing a loop, but IIRC it doesn't know about taken branches, and I think doesn't have a sophisticated model of the frontend, which is obviously the issue if misalignment breaks performance for you.

May be try `-falign-functions=N` to see if it makes any difference? — Maxim Egorushkin, Sep 02 '15 at 21:00
@MaximEgorushkin: OP already tried, see comments on the question. `-falign-functions=64` is apparently enough to flip performance from "fast" to "slow". — Peter Cordes, Sep 02 '15 at 21:08
I could get some good results by using `-falign-loops=32` : then the generated binary seems to be always fast. So my latest suspicion is that the critical alignment condition must be applied on the main decoding loop. For reference, `-falign-loops=16` tend to produce "slow" version. The difficulty is that any change somewhere else in the code can randomly change the result, hence the suspicion for an alignment issue. However, I'm very surprised that the binary size is *exactly the same* at align 16 and 32, which is counter intuitive : I would expect the 32 version to be larger. — Cyan, Sep 03 '15 at 00:57
@Cyan: that is funny that the binary ends up the same size. Maybe something else was still being aligned to 32, and `-falign-loops=32` just pushes something over by 16, leading to less padding after. If you're lucky, `-falign-loops=32` will be the trick to robust good performance, even in the face of other changes. 32B code boundaries are always uop cache-line boundaries. Like I keep saying, though, we can't do anything more than hand-wave until you post a link to the code, or find the actual spot in your loop where the "slow" version stalls, but the "fast" version doesn't. — Peter Cordes, Sep 03 '15 at 01:12
For other speedups, are you using `-O3 -march=native`? For a codec, esp. if you use arithmetic coding that represents a symbol with less than a full byte, you might find the bit-manipulation instructions useful. (BMI/BMI2, introduced with Haswell). I haven't used them myself, but `bextr` looks very useful for unpacking bits from a bitstream. The intrinsic is `uint64_t _bextr_u64 (uint64_t a, unsigned int start, unsigned int len)`. So if you want to make a version of your functions for CPUs with BMI, you might get a speedup on them. And of course vectors if applicable. — Peter Cordes, Sep 03 '15 at 01:24
While `-falign-loops=32` seems to work, it has a downside : it's a gcc-only command flag (not compatible with clang). As a way to mitigate this issue, I tried to implement this option from within the source code, using `#pragma GCC optimize ("align-loops=32")` as suggested by GCC doc. But it did not work : the resulting binary remained slow, as if the pragma was ignored. I tried to get the size and alignment of the loop from generated assembly but failed (too complex dump). Finally, I tried to `perf` slow and fast versions, unfortunately `-pg` change alignments, only generated fast ... — Cyan, Sep 03 '15 at 01:30
@Cyan: `gcc -pg` adds instructions to instrument the binary, producing data that can be read by `gprof`. `perf` uses hardware performance counters to profile any binary, with counts per-instruction. `-pg` and `perf` are totally separate things. I haven't used `#pragma`s, so IDK what's going wrong there. Maybe it's the alignment of a loop in another function that pushes the start of your hot function to a better location? — Peter Cordes, Sep 03 '15 at 01:42
Come on ... Thanks for reminding me correctly these tools. I'll send some results from `perf` in chat. — Cyan, Sep 03 '15 at 01:54
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/88641/discussion-between-cyan-and-peter-cordes). — Cyan, Sep 03 '15 at 02:06

Akio Takahashi · Answer 2 · 2015-09-02T14:51:18.610

0

In my experience, the performance drops may be caused by disabling inlining optimization.

The 'inline' modifier doesn't indicate to force a function to be inlined. It gives compilers a hint to inline a function. So when the compiler's criteria of inlining optimizaion will not be satisfied by trivial modifications of code, a function which is modified with inline is normally compiled to a static function.

And there is a thing make the problem more complex, nested inline optimizations. If you have a inline function, fA, that calls a inline function, fB, like this:

inline void fB(int x, int y) {
    return x * y;
}

inline void fA() {
    for(int i = 0; i < 0x10000000; ++i) {
        fB(i, i+1);
    }
}

void main() {
    fA();
}

In this case, we expect that both fA and fB are inlined. But if the inlining criteia is not met, the performance can't be predictable. That is, large performance drops occur when inlining is disable about fB, but very slight drops for fA. And you know, compiler's internal decisions are much complex.

The reasons cause disabling inlining, for example, size of inlining function, size of .c file, number of local variables, and so on.

Actually, in C#, I am experienced this performance drops. In my case, 60% performance drop occurs when one local variable is added to a simple inlining function.

EDIT:

You can investigate what happens by reading compiled assembly code. I guess there are unexpected real callings to functions modified with 'inline'.

edited Sep 02 '15 at 14:51

answered Sep 02 '15 at 14:45

Akio Takahashi

11
2

The question states that the `inline` keyword is not used. – Cyan Sep 02 '15 at 14:54
A function without 'inline' can be inlined by the compiler. So I think your program is optimized with inlining. – Akio Takahashi Sep 02 '15 at 15:11
Yes it is. Most small functions are being inlined, except when the compiler detects that it's called very rarely. Using `gcc -fdump-ipa-inline`, one can see the inlining decision of gcc. Beware, this is a large dump, which can be difficult to read. I did it on both the slow and fast versions. From the dump, it seems that inlining decisions are identical. – Cyan Sep 02 '15 at 15:42
You can also look at the assembly code generated by the compilation -- I think you compile it with -s. Worst case, you generate the assembly code using the 'fast' version, save the assembly code, then add the extra function, comment out the offending function, and link it with the compiled assembly code. – gbronner Sep 02 '15 at 15:44
I did the "generate assembly" exercise. Problem is, the result is a huge file (several tens of thousands lines). It's way too large for me to decode usefully. My attempt to spot a difference between the "fast" and "slow" versions failed, they look mostly identical. Since all labels are different, it's impossible to make a simple "file comparison", so reading the assembly is the only solution I found to compare. Not practical with such quantity of code. – Cyan Sep 02 '15 at 15:58
@Cyan The result of -fdump-ipa-inline shows candidates for inlining, not actually inlined. Use -fdump-tree-all according to [this](http://stackoverflow.com/questions/9190359/how-to-find-out-which-functions-were-not-inlined), or use -fno-inline to check whether if inlining optimization is the cause or not. – Akio Takahashi Sep 02 '15 at 16:11
`-fdump-ipa-inline` does give the inlined functions, but not at the beginning, which is only about candidates. Later on in the files, it states the definitive answer for each candidate. Besides, I'm sure that inlining is, on average, massively beneficial. This is confirmed by making the test with `-fno-inline` as suggested : speed is reduced by a factor 3. – Cyan Sep 02 '15 at 16:26
It is one of mechanisms that cause performance drops. If the assembly code is essentially identical, inlining is not related to your problem. I'm afraid to say that I can't guess things cause 20% performance drops without changes of algorithm or code flow. – Akio Takahashi Sep 02 '15 at 17:06
Hm, I'd expect the inline "hint" to be always followed if technically possible. I think it basically turns heuristics off. – usr Sep 02 '15 at 20:46

large performance drop with gcc, maybe related to inline

2 Answers2

EDIT:

Linked