10

MSVC supports AVX/AVX2 instructions for years now and according to this msdn blog post, it can automatically generate fused-multiply-add (FMA) instructions.

Yet neither of the following functions compile to FMA instruction:

float func1(float x, float y, float z)
{
    return x * y + z;
}

float func2(float x, float y, float z)
{
     return std::fma(x,y,z);
}

Even worse, std::fma is not implemented as a single FMA instruction, it performs terribly, much slower than a plain x * y + z (the poor performance of std::fma is expected if the implementation doesn't rely on FMA instruction).

I compile with /arch:AVX2 /O2 /Qvec flags. Also tried it with /fp:fast, no success.

So the question is how can MSVC forced to automatically emit FMA instructions?

UPDATE

There is a #pragma fp_contract (on|off), which (looks like) does nothing.

plasmacel
  • 8,183
  • 7
  • 53
  • 101
  • 2
    You probably need to use [compiler intrinsics functions](https://msdn.microsoft.com/en-us/library/hh977022.aspx). – Some programmer dude Dec 14 '15 at 11:38
  • 1
    I know these intrinsics but I'm not interested in them. I want the compiler to automatically generate the instructions, just like GCC and Clang. It's 2016. Furthermore there are many cases when you can't explicitly use these instrinsics, because the fused-multiply-add doesn't belong to a single operation or function, it comes from multiple inline optimized expression. – plasmacel Dec 14 '15 at 11:49
  • 2
    Good luck. From my experience, MS doesn't care about that part of the compiler. Even when you use intrinsics, it does some pretty terrible code generation for FMA instructions. If you care about performance for FMAs on Windows, use a different compiler. (ICC is pretty good) – Mysticial Dec 14 '15 at 15:32
  • To be honest MSVC lacks so many modern features that should be basic elements of a compiler today. Not to mention that it is behind the standard all the time. I was shocked that it won't optimize away small loops where the iteration count (say 4) known at compile time and even there is no pragma or something to explicitly request it. It still implements OpenMP 2.5 so you can't use size_t for omp loops, however OpenMP 4.5 is out now. It offers multiple enhanced instruction sets, yet it doesn't generate proper code for them. I actually use Clang for Windows, but wanted to opt things for MSVC too – plasmacel Dec 14 '15 at 16:18
  • 3
    Are you looking for scalar FMA or packed (vector) FMA? From your code snippet (assuming given functions are not inlined) - MSVS will not be able to generate vector code. I would not be surprised if MSVS only uses FMA, when there is vector code on the table. Did you try to write simple data processing loop, iteratively doing FMA (making sure all arrays are defined in the same function) and compile it with MSVS? – zam Dec 15 '15 at 10:14
  • 2
    [It worked for me with `/O1 /arch:AVX2 /fp:fast` with MSVC 2015](http://stackoverflow.com/questions/15933100/how-to-use-fused-multiply-add-fma-instructions-with-sse-avx/34461738#34461738). – Z boson Jan 04 '16 at 19:29

2 Answers2

4

I solved this long-standing problem.

As it turns out, flags /fp:fast, /arch:AVX2 and /O1 (or above /O1) are not enough for Visual Studio 2015 mode to emit FMA instructions in 32-bits mode. You also need the "Whole Program Optimization" turned on with flag /GL.

Then Visual Studio 2015 will generate an FMA instruction vfmadd213ss for

float func1(float x, float y, float z)
{
    return x * y + z;
}

Regarding std::fma, I opened a bug at Microsoft Connect. They confirmed the behavior that std::fma doesn't compile to FMA instructions, because the compiler doesn't treat it as an intrinsic. According to their response it will be fixed in a future update to get the best codegen possible.

plasmacel
  • 8,183
  • 7
  • 53
  • 101
  • I did not need `/GL`. I think you are compiling in 32-bit mode. That's silly. – Z boson Apr 09 '16 at 16:06
  • 1
    The question didn't mention x64 and in some circumstances it is not possible to compile in 64-bit mode because of the dependencies. – plasmacel Apr 09 '16 at 16:49
  • 1
    Was this fixed on VS 2017 and VS 2019? – Royi May 03 '19 at 11:24
  • @Royi I didn't try since that version. – plasmacel May 03 '19 at 11:29
  • I guess this is before you become `clang` addicted :-). – Royi May 03 '19 at 11:30
  • 1
    @Royi Exactly. Now you can use the official LLVM Visual Studio extension to compile with clang-cl using Visual Studio as an IDE. Or you can use CMake based projects (compiling with any compatible compiler) in Visual Studio 2017 and 2019 which is completely cross-platform. – plasmacel May 03 '19 at 11:31
  • The problem I have is having CLang with OpenMP support as we previously talked - https://chat.stackoverflow.com/rooms/170911. Do you have like a guide to how you build it? – Royi May 03 '19 at 11:34
  • @Royi I think OpenMP binaries are now included in the official LLVM Windows binaries. At least they were in LLVM 7.0. – plasmacel May 03 '19 at 11:35
  • Could we talk about it in the chat? It is a game changer. – Royi May 03 '19 at 11:37
  • 1
    `/GL` whole-program optimization will let it use a custom calling convention (perhaps vectorcall or fully custom) instead of having to return in x87 `st(0)` in 32-bit mode, making it worthwhile to use FMA instructions for tiny functions. This is not a factor when compiling for 64-bit mode, since the standard calling convention already returns scalar float/double in XMM0. – Peter Cordes Aug 16 '22 at 10:38
3

MSVC 2015 does generate an fma instruction for scalar operations but not for vector operations (unless you explicitly use an fma intrinsic).

I compiled the following code

//foo.cpp
float mul_add(float a, float b, float c) {
    return a*b + c;
}

//MSVC cannot handle vectors as function parameters so use const references
__m256 mul_addv(__m256 const &a, __m256 const &b, __m256 const &c) {
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}

with

cl /c /O2 /arch:AVX2 /fp:fast /FA foo.cpp

in MSVC2015 and it produced the following assembly

;mul_add
vmovaps xmm3, xmm1
vfmadd213ss xmm3, xmm0, xmm2
vmovaps xmm0, xmm3

and

;mul_addv
vmovups ymm0, YMMWORD PTR [rcx]
vmulps  ymm1, ymm0, YMMWORD PTR [rdx]
vaddps  ymm0, ymm1, YMMWORD PTR [r8]
Z boson
  • 32,619
  • 11
  • 123
  • 226