run-time large performance drop from gcc 7.5.0-6ubuntu2 to gcc 8.4.0-3ubuntu2

Question

After starting to use gcc 11 of Ubuntu 22.04 I've noticed I have ~90% degradation in my c application performance - the way I measure it.
Narrowing it I saw the degradation happens since gcc 8.4.0-3ubuntu2.
Now I'm on Ubuntu 22.04 using gcc-7 and gcc-8 (and gcc, which is gcc 11).
Compiling the exact same code with gcc-7 has good results, while compiling with gcc-8 (or gcc 11) results in slower application.

I did not find any changes that should matter in gcc 8 changes.
I don't have a simple application. If I had it means I already know the source of this issue.

Any suggestions?
Was something changed since gcc 7.5 to gcc 8.4 ?

** Edit ** - after gprof of old-fast (using gcc-7) and new-slow (using gcc-8) - I think the most valuable thing I see, is that on the new-slow version there's this entry, on the second place of Flat profile:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 39.27      9.83     9.83   173488     0.00     0.00  main_function
 22.89     15.56     5.73                             ...
 ...

usually, the code generated by compilers tends to get faster the more modern your compiler is. So, it's extremely important what you say is "the way I measure it": What are you measuring here? By the way, GCC 8.4 is really old, and GCC 7.5 is paleontoligically ancient, so it might actually be that there's something different at work here, like som code in some library you use actively doing something different. — Marcus Müller, Sep 04 '22 at 12:17
*How* do you measure your performance? Do you remember to measure an optimized build? And what is your program doing? Is it possible to create a [mre] to show us? Have you tried to *profile* the program, to find out where the problems might be? — Some programmer dude, Sep 04 '22 at 12:18
@MarcusMüller - I compile the exact same code with the same compilation directives. I hope it means this is the same code. — hudac, Sep 04 '22 at 12:21
not what I meant. Your code uses some libraries. They might have been updated *a lot* since GCC 7.5. They might simply not doing the same thing. You will still need to explain to us what you're actually measuring, otherwise this question really is missing the most important info. — Marcus Müller, Sep 04 '22 at 12:23
@Someprogrammerdude - my application reads packets from the network. Using the slow application I read ~0.05X bps. Using the fast I read X bps. I do compile with `-O3` both. I tried playing a bit with `perf` but couldn't find anything special... — hudac, Sep 04 '22 at 12:24
reading packets from some network stack is very likely I/O-bound. Did `perf` say you spend most time in your code, or actually in libc or the kernel for the network functionality? Because if both perf runs (old and new) were the same, then your program would take the same time. — Marcus Müller, Sep 04 '22 at 12:26
I'm using DPDK, so I skip the network stack. Using `perf` after compiling with `-O0 -g` I see mostly my functions, but there's nothing specific I can point at — hudac, Sep 04 '22 at 12:35
@hudac Profiling a `-O0` build is pretty pointless. Profile the optimized builds with both compilers and then look for differences. The accumulated call graph view of `perf` might be helpful for that. — user17732522, Sep 04 '22 at 12:37
Build with the `-pg` option to create a file containing profiling information, and use `gprof` to analyze it. — Some programmer dude, Sep 04 '22 at 12:39
@hudac hm, a `-O0` build doesn't yield you any information about performance regressions at all. That's the flag that says "don't make my code be efficient, just make it as directly correspond to the C I wrote as possible". What happens in a `-O2` or `-O3` build? — Marcus Müller, Sep 04 '22 at 12:42
I wasn't sure `-g` can go with `-O3` - I remember the symbols might mix due to `-O3` optimization. I'll try now `-O3 -pg` and `gprof`/`perf` — hudac, Sep 04 '22 at 12:47
I found out something regarding `rte_atomic32_cmpset()`, please see edited question — hudac, Sep 04 '22 at 14:25
`-O3 -g` is totally fine; the debug info is separate metadata, and the optimizer doesn't avoid things that will make debugging meaningless. (If you want consistent debugging, that's what `-O0 -g` is for). `-O3 -pg` adds instrumentation overhead to every call; don't use `perf` with `-pg`, only `gprof`. And keep in mind that `gprof` is profiling a version of your program with some extra overhead. — Peter Cordes, Sep 05 '22 at 04:24
IDK why you'd roll your own `__sync_bool_compare_and_swap` or `__atomic_compare_exchange_n` (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html). Perhaps so you can omit the `lock` prefix in a single-threaded build? But since you are rolling your own, I guess that function looks ok. It doesn't give you the value on failure (so it's like `__sync_bool_compare_and_swap` not `val`), but some use-cases don't need that. Most code should just use C11 `atomic_compare_exchange_weak` (or sometimes `_strong`, although the choice doesn't matter on x86). — Peter Cordes, Sep 05 '22 at 04:32
Anyway, in case I wasn't clear before, I wonder if `-pg` is choosing not to inline your CAS function, so it shows up high in `gprof` profiles when your program spends time spinning on CAS, or just waiting on cache misses. vs. before this cost was distributed where it inlined. — Peter Cordes, Sep 05 '22 at 04:33
Nothing changed in GCC8 that would make inline asm slower, or these inline asm constraints slower. This implementation is already inefficient since it uses `sete` inside that asm template instead of using a GCC6 `"=@cce"` condition-code output operand. ([Using condition flags as GNU C inline asm outputs](https://stackoverflow.com/q/30314907)). But that might cost an extra cycle sometimes, not enough to make it go from negligible to major. — Peter Cordes, Sep 05 '22 at 04:36
@PeterCordes actually this is not my implementation, it belongs to DPDK. I don't even think I call it directly - it's called through one of their other functions. I'll try again using `perf` and `-O3 -g` (no `-pg`) and see if it is still there. — hudac, Sep 05 '22 at 06:50
A quick reminder: `-O3` is not stable optimization, it can actually be _slower_ than `-O2` which is the stable optimization flag. — Mgetz, Sep 07 '22 at 13:31

hudac · Accepted Answer · 2022-09-07T18:16:52.007

0

Ok then,
This was the case:
For some reason gcc-7 did not care about it, but since gcc-8 it became an issue.

As you can see, I had a big array instantiation on the stack of main_function().
sizeof(my_big_struct) -> 100

Pseudo-code:

void main_function() {
  my_big_struct bigstruct_arr[20000];
  ...
}

gcc-7 ran without any problems
gcc-8 (and 11) ran as well, but really slow. I'm not sure why. Too much time for allocation? Or array access?

As you can see from perf, it says exactly that main_funcion() is the problematic one.
It is a bit misleading because an address 0x5594faaa3090 takes all the fault.
I did not understand what this address meant, until I did, that it's that array bigstruct_arr.

Samples: 36K of event 'cycles', Event count (approx.): 13441606624
  Children      Self  Command          Shared Object       Symbol
-   90.47%    88.88%  trd_1            my_process          [.] main_function
   + 88.26% 0x5594faaa3090
   + 1.60% main_function
     0.61% 0

The solution was of course, defining it global or with malloc

edited Sep 07 '22 at 18:16

answered Sep 07 '22 at 13:37

hudac

2,584
6
34
57

1

Default stack size on Linux is 8 MB. Your array is 2 MB if I compute correctly, and presumably the rest of your program needs less than 6 MB, so no overflow. – Nate Eldredge Sep 07 '22 at 13:39
@NateEldredge where do you see this? I see, on my specific server: `ulimit -a ... stack size (kbytes, -s) 8192` – hudac Sep 07 '22 at 13:54
So we agree, right? 8192 KB = 8 MB. – Nate Eldredge Sep 07 '22 at 14:31
Oh damn . Well, so my only question is, why large stack allocation took so long. Or, searching in it. – hudac Sep 07 '22 at 18:15
Are you sure your code was really like your example, and not like `my_big_struct bigstruct_arr[20000] = { ... };`? In the latter case, the compiler has to fully initialize it on every entry to the block; any entries that you didn't explicitly initialize are zeroed out. Whereas if you write the same code at file scope, the initialization is only done once, and with malloc, it's only done when you explicitly `memset` or whatever. – Nate Eldredge Sep 08 '22 at 13:43
Otherwise, I couldn't guess unless you can post a complete, compilable example. You are right that stack allocation should take no time to speak of: simply subtract from the stack pointer, and the first time, fault in the necessary pages. And there is no reason why stack memory should be slower than static or heap memory; it's all just memory. Perhaps when you changed it to global or malloc, you subtly changed the semantics somehow (e.g. pointer aliasing?). And of course there is always the possibility of a compiler mis-optimization bug. – Nate Eldredge Sep 08 '22 at 13:49
@NateEldredge - yes, there was no initialization. mm, I don't know what happened :/ – hudac Sep 08 '22 at 23:37
@NateEldredge: With `gcc -fstack-check`, it will emit code that loops, touching every 4k page as it grows the stack, This is not required on Linux for correctness, but can be done as [hardening against stack clash attacks](https://stackoverflow.com/questions/60058873/linux-process-stack-overrun-by-local-variables-stack-guarding). On the first access, this will cost some page faults, but on later allocations (after this function returns and is called again) it should be pretty trivially fast. Especially with hugepages enabled so there's no TLB miss. – Peter Cordes Sep 10 '22 at 02:48
I do have hugepages. But this issue happens all the time, not just on first access. BTW I've noticed, if I reduce the size of the array, it doesn't happen. – hudac Sep 10 '22 at 19:42

run-time large performance drop from gcc 7.5.0-6ubuntu2 to gcc 8.4.0-3ubuntu2

1 Answers1