STL priority_queue compiled with GCC 9 has slower performance comparing to GCC 5

Question

For my project I switched from GCC 5 to GCC 9 and found that the performance got worse. I did some investigations and came up with a simple source code which reproduces the behaviour.

I compile the code using different GCC versions (g++-5 and g++-9) on the same machine

#include <queue>

int main()
{
        std::priority_queue<int> q;
        for (int j = 0; j < 2000; j ++) {
                for (int i = 0; i < 20000; i ++) {
                        q.emplace(i);
                }
                for (int i = 0; i < 20000; i ++) {
                        q.pop();
                }
        }
        return 0;
}

When I compile it using GCC 5 I get the following timings:

# g++-5 -std=c++14 -O3 main.cpp
# time ./a.out

real    0m1.580s
user    0m1.578s
sys     0m0.001s

Doing the same with GCC 9 I get:

# g++-9 -std=c++14 -O3 main.cpp
# time ./a.out

real    0m2.292s
user    0m2.288s
sys     0m0.003s

As you can see GCC 9 gives slower results.

I am not sure that the issue is in the STL priority_queue itself. I tried the boost priority_queue and got the same results.

Does anyone have a clue why the performance of this app is slower for GCC 9 comparing to GCC 5? Maybe I should use some compiler flags? Thank you in advance!

It would be useful if you could do some manual binary search to narrow it down to the precise version of gcc that introduced the performance regression. GCC 5 to 9 is a pretty big jump of over half a decade. — Thomas, Oct 07 '22 at 12:17
Please also update your question with the exact version numbers (`g++ --version`). — Thomas, Oct 07 '22 at 12:20
Looking at the assembler, I notice that GCC-9 does not inline a call to ```std::__adjust_heap``` whereas GCC-5 does not inline ```std::vector::_M_emplace_back_aux```. Why they chose to do that with a single call-site in both cases is beyond me but I guess it could just be a tweak in the tuning options — Homer512, Oct 07 '22 at 13:57
What CPU do you have? If it's a Skylake, does [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) help? If so, it might just be random chance that GCC5 was fast and GCC9 was slow, separate from any missed-optimizations like poor inlining decisions. — Peter Cordes, Oct 07 '22 at 16:07

score 6 · Answer 1 · answered Oct 07 '22 at 13:05

6

This is not meant to be an answer but since I have a few g++ toolchains available I made a few test runs to see if I could see something interesting regarding this perceived degradation.

The biggest slowdown seems to be between 6.2 and 7.2. Perhaps this table can trigger someone to recall what may be the cause.

I used C++11 since I started with gcc 4, so in all cases except the first one, I used g++ -std=c++11 -O3 main.cpp.

g++ version	real	user	sys
4.5.0 (-std=c++0x)	0m1.711s	0m1.701s	0m0.004s
4.8.5	0m1.673s	0m1.667s	0m0.002s
5.1.0	0m1.586s	0m1.578s	0m0.002s
6.2.0	0m1.775s	0m1.766s	0m0.003s
7.2.0	0m2.192s	0m2.176s	0m0.003s
8.2.0	0m2.192s	0m2.186s	0m0.000s
9.3.0	0m2.122s	0m2.114s	0m0.001s
10.2.0	0m2.308s	0m2.299s	0m0.002s
11.3.0	0m2.293s	0m2.285s	0m0.002s
12.1.0	0m2.306s	0m2.299s	0m0.001s

answered Oct 07 '22 at 13:05

Ted Lyngmo

93,841
5
60
108

Could you try setting a specific ```-march``` option? I believe the default tuning changed. Maybe pick something that should be present in all versions like ```-march=nehalem``` – Homer512 Oct 07 '22 at 13:47
1

@Homer512 I tried `-march=nehalem` with a few toolchain versions (those with the biggest diffs) but the results were pretty consistent. Perhaps I should mention the CPU? It's reported as an Intel(R) Xeon(R) Gold 6242R CPU @ 3.10GHz – Ted Lyngmo Oct 07 '22 at 13:53
BTW, in case you were considering `-march=native`, that won't work well. On a GCC too old to know about `-march=skylake-avx512`, it will still enable the ISA extension options it knows about, but you won't get a `-mtune=something-recent`, it just gives up and uses `-mtune=generic` if you use a GCC too old to konw about your CPU specifically. So `-march=nehalem` to imply `-mtune=nehalem` is a reasonable choice. – Peter Cordes Oct 07 '22 at 15:55
2

Of course your CPU *isn't* a Nehalem... It is a Skylake, where microcode updates have introduced a few performance pot-holes. One that needs compilers to work around it, if a tight loop happens to step in it: [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) – Peter Cordes Oct 07 '22 at 15:57
@PeterCordes Re: _"in case you were considering"_ - guilty. I tried. Could we set something up that can give us some insight? I'm willing to re-test properly. – Ted Lyngmo Oct 07 '22 at 23:06
`-march=nehalem` is *probably* fine. `-mtune=sandybridge` or `-mtune=corei7-avx` might work, at least for the GCCs new enough to know them. Also use `-Wa,-mbranches-within-32B-boundaries` to mitigate the problem caused by the microcode workaround for the JCC erratum; that's always a prime suspect for micro-benchmarks on SKL/SKX, esp. if front-end throughput is a problem. But really the best bet is to figure out what asm (or machine-code alignment) difference was causing the big change, and then work from there to see which GCC options or versions help or not with it. – Peter Cordes Oct 08 '22 at 04:06
@PeterCordes _"the best bet is to figure out what asm (or machine-code alignment) difference was causing the big change"_ - I will try to build as best matrix as I can when I'm back at the store. For _our_ particular needs I think we're not going to change just now, but it's always nice to keep an eye out for options. I'm also not capable to say "what's what" in assembly. – Ted Lyngmo Oct 08 '22 at 04:37
Oh, if you mean for production use, `-march=native` *with a recent GCC version* is supposed to be good, that's what `-march=native` is indented for. The reason not to use it for this test is that we want to try ancient GCC versions quite a bit older than your CPU, which will fall back to `-mtune=generic` if they don't support a `-march=skylake-avx512`. I would actually strongly recommend *against* `-mtune=sandybridge` for general use on a Skylake in cases that include auto-vectorization. ([Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](https://stackoverflow.com/q/52626726)) – Peter Cordes Oct 08 '22 at 04:41
@PeterCordes Ok, let's see if I can keep up. I did at first just do `g++ -std=c++11 -O3`. Then I tried `-march=nehalem` on select versions. I did try `-march=native` too even though I didn't mention it. I didn't actually see any diff worth mentioning. What kind of matrix is worth building here? I am absolutely not the guy who decides, but I can try things out given instructions. – Ted Lyngmo Oct 08 '22 at 04:45

STL priority_queue compiled with GCC 9 has slower performance comparing to GCC 5

1 Answers1