3

My goal is to develop code that compiles using SIMD instructions when they are available and doesn't when they are not. More specifically in my C code I am making explicit SIMD calls and checking whether or not these calls are valid based on processor info I am pulling.

I had a bunch of questions but after enough typing SO pointed me to: Detecting SIMD instruction sets to be used with C++ Macros in Visual Studio 2015

The only remaining question is how does the /arch flag impact explicit SIMD code? Does this still work even if your arch is not set? For example, can I write AVX2 calls without having /arch:AVX2?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jimbo
  • 2,886
  • 2
  • 29
  • 45
  • 1
    You can write SSE4 code without any /arch flags, but last I read you needed to enable `/arch:AVX` to avoid SSE-AVX transition penalties when using AVX / AVX2 intrinsics. Unlike GCC/Clang, MSVC doesn't stop you from using intrinsics for instructions you haven't told the compiler it's allowed to use itself. – Peter Cordes Jun 13 '18 at 01:01
  • @PeterCordes Is that because you're mixing compiler optimized/generated SSE (SSE2?) with explicit AVX? I would assume if the program is relatively sparse and consisted mostly of explicit SIMD calls that you wouldn't run into this problem because there wouldn't be much compiler generated SSE code? – Jimbo Jun 13 '18 at 01:08
  • Even in a manually-vectorized function, you'll often use some intrinsics for instructions that have a 128-bit non-VEX encoding. e.g. maybe you `_mm256_loadu_si256` and extract / pack down to a `__m128i`. This might have changed, but I think without `/arch:AVX` MSVC would emit `paddw` instead of `vpaddw` for `_mm_add_epi16` even in a function that also uses `_mm256_whatever`. – Peter Cordes Jun 13 '18 at 01:21
  • I just checked, and MSVC 2015 doesn't do that. https://godbolt.org/g/LQouiE shows that you get `vpaddd` for `_mm_add_epi32` in this simple case when the input comes from a 256-bit load. I'm pretty sure I read that was a concern with older MSVC, but maybe not. I don't use that compiler, other than to sometimes look at its asm output for SO answers. – Peter Cordes Jun 13 '18 at 02:10

1 Answers1

3

There's a few pieces to the answer here.

First, classic Intel 32-bit x86 code used the x87 instruction set for floating-point, and the compiler would generate code using float and double types for the x87. For a long time, this was the default behavior for the Visual C++ compiler when building for 32-bit. You can force it's use with 32-bit code with /arch:IA32--this switch is not valid for 64-bit.

For AMD64 64-bit code (which has also been adopted by Intel for 64-bit known generically as x64), the x87 instruction set was deprecated along with 3DNow! and Intel MMX instructions when running in 64-bit mode. All float and double type code instead is generated using SSE/SSE2, although not necessarily using the full 4 float or 2 double element wide of the XMM registers. Instead the compiler will usually generate scalar versions of the SSE/SSE2 instructions that use just XMML--and in fact the __fastcall calling convention and .NET marshalling rules for 64-bit only deal with XMML as a result. This is the default behavior for the Visual C++ compiler when building for 64-bit. You can also use the same codegen for 32-bit via the /arch:SSE or /arch:SSE2 switches--those switches aren't valid for x64 because they have to be there already.

Starting with Visual C++ 2015, /arch:SSE2 is the default for 32-bit code gen and is implicitly required for all 64-bit code gen.

This brings us to /arch::AVX. For both 32-bit and 64-bit codegen, this let's the compiler use the VEX prefix to encode the SSE/SSE2 instructions (either generated by the math compiler I talked about above or via explicit use of compiler intrinsics). This encoding uses 3-operands (dest, src1, src2) instead of the traditional 2-operand (dest/src1, src2) for Intel code. The net result of this is that all SSE/SSE2 code-gen makes more efficient use of the available registers. This is really the bulk of what using /arch:AVX gets you.

There are other aspects of the compiler that also make use of the /arch switch settings such as optimized memcpy and the instruction set that is available for use by the auto-vectorizer in /O2 and /Ox builds, etc. The compiler also assumes that if you use /arch:AVX it is free to use SSE3, SSSE3, SSE4.1, SSE4.2, or AVX instructions as well as SSE/SSE2.

With /arch:AVX2 you get the same behavior with the VEX prefix and instruction sets, plus the compiler may choose to optimize the code to use the fused-multiply add (FMA3) instruction which is required for AVX2. The auto-vectorizer also can use AVX2 instructions with this switch active.

TL;DR: If you use the compiler intrinsics, you are assuming responsibility for making sure they won't crash at runtime due to an invalid instruction exception. The /arch switch just let's you tell the compiler to use advanced instruction sets and encoding everywhere.

See this blog series for more details: DirectXMath: SSE, SSE2, and ARM-NEON; SSE3 and SSSE3; SSE4.1 and SSE 4.2; AVX; F16C and FMA; AVX2; and ARM64.

Chuck Walbourn
  • 38,259
  • 2
  • 58
  • 81
  • So there's no problem with using AVX intrinsics without enabling `/arch:AVX`? Did there ever used to be with older MSVC? I thought I remembered Cody Gray saying that it wasn't safe, but I might be imagining things. Does MSVC disable inlining of functions that use AVX into functions that don't, to prevent mixing VEX and non-VEX? – Peter Cordes Jun 13 '18 at 05:40
  • 1
    It works, but might not have good performance because the compiler has to be conservative about the state switches if using 256-bit registers. Of course, there are 128-bit AVX instructions too which don't have the state concerns. That's why some code will generate: ``warning C4752: found Intel(R) Advanced Vector Extensions; consider using /arch:AVX`` and others won't. – Chuck Walbourn Jun 13 '18 at 05:56
  • I will add that even the latest VS-2017 (**cl ver. 19.13.26129**) have problems (INTERNAL COMPILER ERROR) with SSE2/AVX and the optimiser. Here's the report at [Developer Community](https://developercommunity.visualstudio.com/content/problem/215602/new-optimizer-code-generation-bug.html). I've cooked up a test-case for it [here](https://gist.github.com/gvanem/724e3e15172d6eb4d94d3467fc618b4a) . – G.Vanem Jul 04 '18 at 07:42
  • The latest production version is VS 2017 (15.7.4) which would be cl ver 19.14.26431 – Chuck Walbourn Jul 04 '18 at 10:01
  • Best as I can tell, `/arch:AVX2` also switches on F16C-support. This is mentioned (by the author of this post, no less :)) e.g. in https://walbourn.github.io/directxmath-avx2/ and https://github.com/Microsoft/DirectXMath/issues/31. Note also that the links at the bottom of the post are outdated - my first link is the last of the blog series, and contains the links to the others at the bottom. – Axel Feb 12 '21 at 16:38
  • 1
    ``/arch:AVX2`` implies both F16C and FMA3 support. The MSVC compiler will emit F16C instructions if you use those intrinsics even without a specific ``/arch`` setting, but clang/LLVM won't unless you use ``-mavx2`` or ``-mf16c``. I'll update the links (all MSDN personal blogs had to be migrated per [this post](https://walbourn.github.io/Welcome-to-GitHub-Pages/)). – Chuck Walbourn Feb 12 '21 at 22:13
  • There is one real-world CPU with AVX2 but not FMA3, but it's made by Via. (Eden possibly? Having a hard time finding it. Mysticial reported having met the lead architect (or an architect) at a conference, and hearing he was very unhappy when he realized they were going to be making a CPU with AVX2 but not FMA3. (There are also some CPUs with FMA3 but only AVX1, e.g. Piledriver / Steamroller. And those do have significant usage in Windows computers, unlike Via.) – Peter Cordes Apr 09 '22 at 21:05
  • AVX2 without FMA3. Ugh. That's not going to work for MVSC ``/arch:AVX2`` at all :( Thanks for the heads up. For DirectXMath, I'd fail to validate that CPU as having the required support when building with ``/arch:AXV2``, but it would work with ``/arch:AVX``. – Chuck Walbourn Apr 09 '22 at 22:54