I'm currently experiencing some weird effect with gcc
(tested version: 4.8.4).
I've got a performance oriented code, which runs pretty fast. Its speed depends for a large part on inlining many small functions.
Since inlining across multiple .c
files is difficult (-flto
is not yet widely available), I've kept a lot of small functions (typically 1 to 5 lines of code each) into a common C file, into which I'm developing a codec, and its associated decoder. It's "relatively" large by my standard (about ~2000 lines, although a lot of them are just comments and blank lines), but breaking it into smaller parts opens new problems, so I would prefer to avoid that, if that is possible.
Encoder and Decoder are related, since they are inverse operations. But from a programming perspective, they are completely separated, sharing nothing in common, except a few typedef and very low-level functions (such as reading from unaligned memory position).
The strange effect is this one:
I recently added a new function fnew
to the encoder side. It's a new "entry point". It's not used nor called from anywhere within the .c
file.
The simple fact that it exists makes the performance of the decoder function fdec
drops substantially, by more than 20%, which is way too much to be ignored.
Now, keep in mind than encoding and decoding operations are completely separated, and share almost nothing, save some minor typedef
(u32
, u16
and such) and associated operations (read/write).
When defining the new encoding function fnew
as static
, performance of the decoder fdec
increases back to normal. Since fnew
isn't called from the .c
, I guess it's the same as if it was not there (dead code elimination).
If static fnew
is now called from the encoder side, performance of fdec
remains strong.
But as soon as fnew
is modified, fdec
performance just drops substantially.
Presuming fnew
modifications crossed a threshold, I increased the following gcc
parameter: --param max-inline-insns-auto=60
(by default, its value is supposed to be 40.) And it worked : performance of fdec
is now back to normal.
And I guess this game will continue forever with each little modification of fnew
or anything else similar, requiring further tweak.
This is just plain weird. There is no logical reason for some little modification in function fnew
to have knock-on effect on completely unrelated function fdec
, which only relation is to be in the same file.
The only tentative explanation I could invent so far is that maybe the simple presence of fnew
is enough to cross some kind of global file threshold
which would impact fdec
. fnew
can be made "not present" when it's: 1. not there, 2. static
but not called from anywhere 3. static
and small enough to be inlined. But it's just hiding the problem. Does that mean I can't add any new function?
Really, I couldn't find any satisfying explanation anywhere on the net.
I was curious to know if someone already experienced some equivalent side-effect, and found a solution to it.
[Edit]
Let's go for some more crazy test.
Now I'm adding another completely useless function, just to play with. Its content is strictly exactly a copy-paste of fnew
, but the name of the function is obviously different, so let's call it wtf
.
When wtf
exists, it doesn't matter if fnew
is static or not, nor what is the value of max-inline-insns-auto
: performance of fdec
is back to normal.
Even though wtf
is not used nor called from anywhere... :'(
[Edit 2]
there is no inline
instruction. All functions are either normal or static
. Inlining decision is solely within compiler's realm, which has worked fine so far.
[Edit 3]
As suggested by Peter Cordes, the issue is not related to inline, but to instruction alignment. On newer Intel cpus (Sandy Bridge and later), hot loop benefit from being aligned on 32-bytes boundaries.
Problem is, by default, gcc
align them on 16-bytes boundaries. Which gives a 50% chance to be on proper alignment depending on length of previous code. Hence a difficult to understand issue, which "looks random".
Not all loop are sensitive. It only matters for critical loops, and only if their length make them cross one more 32-bytes instruction segment when being less ideally aligned.