What is the /d2vzeroupper MSVC compiler optimization flag doing?

Question

What is the /d2vzeroupper MSVC compiler optimization flag doing?

I was reading through this Compiler Options Quick Reference Guide for Epyc CPUs from AMD: https://developer.amd.com/wordpress/media/2020/04/Compiler%20Options%20Quick%20Ref%20Guide%20for%20AMD%20EPYC%207xx2%20Series%20Processors.pdf

For MSVC, to "Optimize for 64-bit AMD processors", they recommend to enable /favor:AMD64 /d2vzeroupper.

What /favor:AMD64 is doing is clear, there is documentation about that in the MSVC docs. But I can't seem to find /d2vzeroupper being mentioned anywhere in the internet at all, no documentation anywhere. What is it doing?

There appears to be some info in [this question](https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake). — 500 - Internal Server Error, Sep 24 '21 at 17:53

Alex Guteniev · Accepted Answer · 2022-05-16T20:11:20.843

4

TL;DR: When using /favor:AMD64 add /d2vzeroupper to avoid very poor performance of SSE code on both current AMD CPUs and Intel CPUs.

Generally /d1... and /d2... are "secret" (undocumented) MSVC options to tune compiler behavior. /d1... apply to complier front-end, /d2... apply to compiler back-end.

/d2vzeroupper enables compiler-generated vzeroupper instruction

See Do I need to use _mm256_zeroupper in 2021? for more information.

Normally it is by default. You can disable it by /d2vzeroupper-. See here: https://godbolt.org/z/P48crzTrb

/favor:AMD64 switch suppresses vzeroupper, so /d2vzeroupper enables it back.

The up-to-date Visual Studio 2022 has fixed that, so /favor:AMD64 still emits vzeroupper and /d2vzeroupper is not needed to enable it.

Reason: current AMD optimization guides (available from AMD site; direct pdf link) suggest:

2.11.6 Mixing AVX and SSE

There is a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero data. Transitioning in either direction will cause a micro-fault to spill or fill the upper 128 bits of all 16 YMM registers. There will be an approximately 100 cycle penalty to signal and handle this fault. To avoid this penalty, a VZEROUPPER or VZEROALL instruction should be used to clear the upper 128 bits of all YMM registers when transitioning from AVX code to SSE or unknown code

Older AMD processor did not need vzeroupper, so /favor:AMD64 implemented optimization for them, even though penalizing Intel CPUs. From MS docs:

/favor:AMD64

(x64 only) optimizes the generated code for the AMD Opteron, and Athlon processors that support 64-bit extensions. The optimized code can run on all x64 compatible platforms. Code that is generated by using /favor:AMD64 might cause worse performance on Intel processors that support Intel64.

edited May 16 '22 at 20:11

answered Sep 24 '21 at 17:57

Alex Guteniev

12,039
2
34
79

Interesting, thanks! Do you know how it makes sense that AMD recommends to manually specify /d2vzeroupper on MSVC if it's enabled by default anyways? That seems a bit weird. Are you sure that not adding /d2vzeroupper has the exact same effect as adding /d2vzeroupper, on all code? – JohnAl Sep 24 '21 at 18:05
1

@JohnAl I've experimented a bit more and edited my answer. Now it should make sense – Alex Guteniev Sep 24 '21 at 18:10
2

@JohnAl: `/favor:AMD64` is an overall tuning option (like `gcc -mtune=znver1`) that presumably has many more effects, like setting values for inlining and loop unrolling decision heuristics. AMD CPUs (up to and including at least Zen 1; IDK how Zen2 handles things now that it has full 256-bit wide SIMD register entries and execution units) don't have SSE/AVX transition penalties and thus don't need vzeroupper for performance reasons. But without `vzeroupper`, your code could run *very* poorly on Intel CPUs so AMD might be suggesting that to make binaries that are generally usable. – Peter Cordes Sep 24 '21 at 18:31
2

@JohnAl: Or maybe that's a sign that current AMD CPUs *do* need `vzeroupper` to avoid SSE/AVX transition penalties, like Intel CPUs with AVX. – Peter Cordes Sep 24 '21 at 18:32
@PeterCordes, the later is true. I've edited my answer to quote _Software Optimization Guide for AMD EPYC™ 7003 Processors_ – Alex Guteniev Sep 25 '21 at 11:02
2

@JohnAl as it is now clear that the recommendation is right for the current AMD, I've created Dev Com issue to make `/favor:AMD64` not skipping `vzeroupper` https://developercommunity.visualstudio.com/t/favor:AMD64-should-emit-vzeroupper-for/1539224 – Alex Guteniev Sep 25 '21 at 11:29

What is the /d2vzeroupper MSVC compiler optimization flag doing?

1 Answers1

2.11.6 Mixing AVX and SSE

Linked