GCC 12 (minGW 64): how to enable fused multiply add code generation

Question

I apologize in advance in case the answer to my question is obvious but trust me, I have been googling the whole day and searched here aswell without finding anything relevant to it.

I am using GCC 12 (minGW x64) on my x64 windows i7 setup. I don't seem to manage to have GCC generating any float multiply-add instructions.

The simplest case:

float func(float a, float b, float c)
{
   return a*b+c;
}

produces this assembly:

mulss %xmm1, %xmm0
addss %xmm2, %xmm0
ret

No fused multiply/add instruction!

EDIT: this output is produced with the -O3 option

I tried all possible optimization and cpu target options, including -ffast-math and -march=corei7 to no avail.

EDIT: sorry I made a mistake, I made a typo while trying -mfma, I thought it was set while it was not. Sorry for having stated erroneously I tried it in the first version of my question

I am missing something elementary ? How can I have GCC generating those mul/add instructions automatically ?

I then thought I had to do that explicitely, so I tried the fmaf() function, but it simply results in a jmp to a lib function, which is even worse !

UPDATE: it looks like, together with -O3 (which however I am always using by default), I have to either set -mfma or -march=haswell, for the fma instructions to be generated, which (I could check with some benchmarks) really bring some substantial speed improvement in time critical code, where there are chains of sums and multiplications. What I don't fully understand is why simply using -march=corei7 or -march=corei7-avx is not enough. If fma generation was disabled because of the stack alignment bug with MinGW (as somebody mentioned in the comments), then it should be disabled even when specifying -march=haswell...

Thanks

[Works here with `-O3 -mfma`](https://godbolt.org/z/E7rb1Tfz1) — Nate Eldredge, Nov 13 '22 at 19:49
Note that MinGW for Windows may not be safe with AVX 256-bit vectors; it sometimes doesn't align the stack by 32 before spill/reload of vectors using 32-byte aligned loads/stores. Clang doesn't have this problem, nor does GCC for Linux. This bug has existed for years, and MinGW apparently still hasn't fixed it. IDK why not, but it's a problem if you're considering using more than just scalar FMAs. [How to align stack at 32 byte boundary in GCC?](https://stackoverflow.com/q/5983389) — Peter Cordes, Nov 13 '22 at 21:25
If `gcc -O3 -march=native` doesn't work, especially with `-ffast-math`, [edit] your question with details like the exact compiler version from `gcc -v`, and the exact command you ran. It's extremely suspicious that you got `mulss` not `vmulss` in a build with `-mfma`, because that implies `-mavx`, and when AVX is available GCC will always use the VEX encoding, with the `v...` mnemonic. That suggests to me that this asm output came from a build without `-mfma`. Or that MinGW completely disabled AVX support because of the bug I mentioned last comment! — Peter Cordes, Nov 13 '22 at 21:53
`gcc -v -fverbose-asm -O3 -march=native -S -o foo.s foo.c` should include asm comments with details on what options GCC thinks are enabled. Or maybe not, I think newer GCC versions stopped including huge amount of comments in their .s output with -fverbose-asm. — Peter Cordes, Nov 13 '22 at 21:54
Thanks to all for the contributions. I have updated/fixed the question. As I specified (very sorry) I made a mistake when checking the .asm resulting from the -mfma option set and I thought I had set it while I hadn't (gcc-induced headaches...) However to me it still remains a mystery why specifying -march=corei7 is NOT enough, I have rather to either specify -mfma explicitly or even -march=haswell, which is an architecture from 2014... Or this has really something to do with the stack alignment bug Peter Cordes mentioned ? — elena, Nov 13 '22 at 23:55
@elena: Didn't see your comment earlier since you didn't @ notify me. `-march=corei7-avx` is `-march=sandybridge`, the first generation of i7 to have AVX. It didn't have AVX2 or FMA. `corei7` is first gen Nehalem from 2008. The `-march=core*` naming scheme is terrible, don't use it. Use microarchitecture names, or `-mtune=skylake -mfma -mavx2 -mbmi -mbmi2` or whatever to enable some specific extensions but not everything that a CPU supports, in case you ever want to do that. Or with recent GCC, `-march=x86-64-v3` (basically a Haswell baseline or FMA+AVX2 and BMI1/2, with tune=generic.) — Peter Cordes, Nov 14 '22 at 05:03
@PeterCordes thank you for your comment, very comprehensive, I didn't know of -march=x86-64-v3, it definitely does a nice job it seems. Re: mingw stack alignment bug, by now I am testing the code containing fma quite extensively and I can say I experienced no crashes or other strange things. I am not using vectorization explicitly btw because I am not expert with that yet, unless gcc automatically vectorizes something because of optimization — elena, Nov 18 '22 at 22:56

GCC 12 (minGW 64): how to enable fused multiply add code generation

0 Answers0