Enabling AVX512 support on compilation significantly decreases performance

Question

I've got a C/C++ project that uses a static library. The library is built for 'skylake' architecture. The project is a data processing module, i.e. it performs many arithmetic operations, memory copying, searching, comparing, etc.

The CPU is Xeon Gold 6130T, it supports AVX512. I tried to compile my project with both -march=skylake and -march=skylake-avx512 and then link with the library.

In case of using -march=skylake-avx512 the project performance is significantly decreased (by 30% on average) in comparison to the project built with -march=skylake.

How can this be explained? What could be the reason?

Info:

Linux 3.10
gcc 9.2
Intel Xeon Gold 6130T

Did you check your CPU clock speed when running app build for AVX-512 vs AVX2? — Anty, Aug 19 '20 at 09:54
@Anty Not yet. How can I do it on Linux? (The Linux server is a remote server, I access it by ssh.) — Rom098, Aug 19 '20 at 09:59
`-march=skylake-avx512` on GCC9.2 defaults to `-mprefer-vector-width=256` (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). Are you doing anything to override that? IDK if `#pragma omp simd` would maybe still use 512-bit. What exact compile options are you using? If you do rule out simple clock-speed differences (with `perf stat`), can you find one or two specific loops that slow down and show the source and asm differences? — Peter Cordes, Aug 19 '20 at 20:42
@PeterCordes Here are the options I use "`-std=c++17 -fno-rtti -fmax-errors=5 -fno-exceptions -O3 -DNDEBUG -march=skylake -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE`". I've removed all `-Wnnn` options from the line. — Rom098, Aug 20 '20 at 07:44
Ok, so we don't expect any 512-bit vectorization. So either it's silly anti-optimization due to shooting itself in the foot with AVX-512VL, or different tune settings from `skylake-avx512` (Possibly because of larger L2 cache, or something?) Or some other difference. Still a good idea to use `perf stat` to check for clock speed differences in case a 512-bit instruction crept in. ([SIMD instructions lowering CPU frequency](https://stackoverflow.com/q/56852812) and Maxim's answer) — Peter Cordes, Aug 20 '20 at 07:58
Or maybe a stray 512-bit instruction in libc making everything else slow, like [Dynamically determining where a rogue AVX-512 instruction is executing](https://stackoverflow.com/q/52008788)? But that only slows down SSE code by creating false dependencies, I think, and should be fixed by a `vzeroupper` at some point in something compiled with `-march=skylake`. — Peter Cordes, Aug 20 '20 at 07:59

Maxim Egorushkin · Accepted Answer · 2020-08-19T10:22:09.623

project performance is significantly decreased (by 30% on average)

In code that cannot be easily vectorized sporadic AVX instructions here and there downclock your CPU but do not provide any benefit. You may like to turn off AVX instructions completely in such scenarios.

See Advanced Vector Extensions, Downclocking:

Since AVX instructions are wider and generate more heat, Intel processors have provisions to reduce the Turbo Boost frequency limit when such instructions are being executed. The throttling is divided into three levels:

L0 (100%): The normal turbo boost limit.

L1 (~85%): The "AVX boost" limit. Soft-triggered by 256-bit "heavy" (floating-point unit: FP math and integer multiplication) instructions. Hard-triggered by "light" (all other) 512-bit instructions.

L2 (~60%): The "AVX-512 boost" limit. Soft-triggered by 512-bit heavy instructions. The frequency transition can be soft or hard. Hard transition means the frequency is reduced as soon as such an instruction is spotted; soft transition means that the frequency is reduced only after reaching a threshold number of matching instructions. The limit is per-thread.

Downclocking means that using AVX in a mixed workload with an Intel processor can incur a frequency penalty despite it being faster in a "pure" context. Avoiding the use of wide and heavy instructions help minimize the impact in these cases. AVX-512VL is an example of only using 256-bit operands in AVX-512, making it a sensible default for mixed loads.

Also, see

`-march=skylake-avx512` on GCC9.2 defaults to `-mprefer-vector-width=256` for these reasons. The availability of AVX512 is more likely leading GCC into other missed optimizations, like perhaps larger code size by using `vmovdqu64` instead of VEX-coded `vmovdqu`, or using compare-into-mask instead of AVX2 compare into vector when that would actually be better. Unless maybe OpenMP vectorization uses 512-bit? — Peter Cordes, Aug 19 '20 at 20:38
@PeterCordes I often encounter gcc generating rather sub-optimal code with AVX, version 10 has some fixes, like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91722 — Maxim Egorushkin, Aug 19 '20 at 20:52
The OP is using GCC9.2 so we're trying to explain the results with that compiler. That missed optimization of over-aligning the stack without using it happens even with `-march=skylake`. But yes, I'd agree with your recommendation to upgrade to GCC10 for AVX / AVX512 code-gen in general. The newer the compiler the better when dealing relatively new instruction sets. — Peter Cordes, Aug 19 '20 at 20:57

Enabling AVX512 support on compilation significantly decreases performance

1 Answers1