48

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX:
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2.

I like to know how to do this best in code and I also want to know how it's done internally in the CPU. I mean with the super-scalar architecture. Let's say I want to do a long sum such as the following in SSE:

//sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication)
sum = _mm_set1_ps(0.0f);
a1  = _mm_set1_ps(a[0]); 
b1  = _mm_load_ps(&b[0]);
sum = _mm_add_ps(sum, _mm_mul_ps(a1, b1));

a2  = _mm_set1_ps(a[1]); 
b2  = _mm_load_ps(&b[4]);
sum = _mm_add_ps(sum, _mm_mul_ps(a2, b2));

a3  = _mm_set1_ps(a[2]); 
b3  = _mm_load_ps(&b[8]);
sum = _mm_add_ps(sum, _mm_mul_ps(a3, b3));
...

My question is how does this get converted to simultaneous multiply and add? Can the data be dependent? I mean can the CPU do _mm_add_ps(sum, _mm_mul_ps(a1, b1)) simultaneously or do the registers used in the multiplication and add have to be independent?

Lastly how does this apply to FMA (with Haswell)? Is _mm_add_ps(sum, _mm_mul_ps(a1, b1)) automatically converted to a single FMA instruction or micro-operation?

Community
  • 1
  • 1

2 Answers2

50

The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate).

An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two.

The IEEE and C standards allow this when #pragma STDC FP_CONTRACT ON is in effect, and compilers are allowed to have it ON by default (but not all do). Gcc contracts into FMA by default (with the default -std=gnu*, but not -std=c*, e.g. -std=c++14). For Clang, it's only enabled with -ffp-contract=fast. (With just the #pragma enabled, only within a single expression like a+b*c, not across separate C++ statements.).

This is different from strict vs. relaxed floating point (or in gcc terms, -ffast-math vs. -fno-fast-math) that would allow other kinds of optimizations that could increase the rounding error depending on the input values. This one is special because of the infinite precision of the FMA internal temporary; if there was any rounding at all in the internal temporary, this wouldn't be allowed in strict FP.

Even if you enable relaxed floating-point, the compiler might still choose not to fuse since it might expect you to know what you're doing if you're already using intrinsics.


So the best way to make sure you actually get the FMA instructions you want is you actually use the provided intrinsics for them:

FMA3 Intrinsics: (AVX2 - Intel Haswell)

  • _mm_fmadd_pd(), _mm256_fmadd_pd()
  • _mm_fmadd_ps(), _mm256_fmadd_ps()
  • and about a gazillion other variations...

FMA4 Intrinsics: (XOP - AMD Bulldozer)

  • _mm_macc_pd(), _mm256_macc_pd()
  • _mm_macc_ps(), _mm256_macc_ps()
  • and about a gazillion other variations...
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • Thanks, that more or less answers my question about FMA. I should really spend some time learning some x86 assembly. That would probably answer most of my questions. –  Apr 10 '13 at 18:39
  • As for your question about whether a multiply and an add can be done simultaneously (FMA). The answer is no since the add uses the result of the multiply. So you eat the latency of add + multiply. An FMA instruction does both instructions together - usually with the same latency as a single muliply. So the add is free. – Mysticial Apr 10 '13 at 18:45
  • 1
    Thanks, that's what I thought. Now I just need to figure out how to organize my code so that the sum like I defined above does independent adds and multiplies simultaneously (so I avoid latencies). –  Apr 10 '13 at 19:10
  • You actually don't need to spend too much time on it. The hardware is capable of reordering things. As it is right now, all the multiplies are independent. But the adds are all dependent on the `sum` variable. So your best best is to split the `sum` variable into multiple variables and multiple sum chains in parallel across all of them. Then at the end, combine them back into one `sum`. – Mysticial Apr 10 '13 at 19:12
  • Okay, I do something like that already (unrolling my loop). I assume I don't need separate sum variables for every sum? Right now I do four (sum1, sum2, sum3, sum4) and then add those together and increment my loop counter by four. –  Apr 10 '13 at 19:17
  • 2
    You only need to separate them as much as it takes to reach the max throughput. The critical path is on the additions. The latency of an `addps` is 3 cycles. But the throughput is 1. So you need a minimum of 3 separate sum chains to fully utilize it. You currently have 4, so that's sufficient. – Mysticial Apr 10 '13 at 19:41
  • Thanks, that answers all my questions. –  Apr 10 '13 at 19:46
  • gcc in practice *does* fuse mul/add intrinsics at `-O3`, without `-ffast-math`, even in situations where FLT_EVAL_METHOD=0, not 2. This is supported by some documentation (that keeping infinite precision for temporaries is always legal, to allow collapsing expressions). There's [a new question](http://stackoverflow.com/questions/34436233/fused-multiply-add-and-default-rounding-modes) about this. – Peter Cordes Dec 23 '15 at 13:06
  • @PeterCordes That's good to know. Either I wasn't aware of that at the time, or things have changed. – Mysticial Dec 23 '15 at 17:28
  • Are you sure that "the compiler will violate strict IEEE floating-point behavior by fusing". I am not so sure any more. I think maybe IEEE supports multiple "modes" one of which allows single rounding contractions. GCC uses x87 for 32-bit code by default and I don't think that violates strict IEEE behaviour. GCC by default uses x87 for 32-bit code, SSE for 64-bit code, and FMA for FMA hardware. I don't think any of these cases necessarily violate IEEE. Clang on the other hand uses SSE by default in all thread case so it's more consistent. – Z boson Jan 04 '16 at 08:40
  • @Zboson I believe it's not allowed if you store the intermediate into a variable first. C/C++ allows immediates to be done at higher precision. That implies it can go ahead and fuse `_mm_add_ps(c, _mm_mul_ps(a, b))`. But not if you do `temp = _mm_mul_ps(a, b); temp = _mm_add_ps(temp, c);` But we're getting into very pedantic parts of the standard so I can't say for sure. It also doesn't help when compilers like ICC (by default) will break from the standard specification. – Mysticial Jan 04 '16 at 09:04
  • Well that's why I am asking because GCC by default does fma (it uses `-ffp-contract=fast` by default) if you use `-mfma` so I want to know if GCC is breaking the standard. From your answer i would assume it is . I know ICC uses a fast floating point model by default. – Z boson Jan 04 '16 at 09:08
  • I am surprised nobody answer [my question about this](http://stackoverflow.com/questions/34436233/fused-multiply-add-and-default-rounding-modes) (maybe because I asked more than one question). Maybe I need to ask it again but be more clear. – Z boson Jan 04 '16 at 09:17
  • 1
    I think your answer is misleading since a compiler can uses FMA by default without breaking IEEE rules http://stackoverflow.com/a/34817983/2542702 – Z boson Jan 19 '16 at 12:55
  • @Zboson Feel free to edit it accordingly. Since this apparently seems to have changed over the years. At this point, you know more than me on the topic. – Mysticial Jan 19 '16 at 13:14
  • Just to confirm, not available in Sandy Bridge? – Alec Teal Jun 29 '19 at 15:10
  • @AlecTeal Correct. Sandy Bridge does not have FMA. – Mysticial Jun 30 '19 at 04:12
  • @Mysticial just to confirm (I should have been clearer) Sandy Bridge lacks /any/ FMA instruction? IIRC that's "no FMA in:" SSSE3, SSE4.1, SSE4.2 and AVX and obviously the other SSEs. I can't believe it was nearly 8 years ago! I'm pretty sure Intel/AMD kept switching on FMA3/4 which came from SSE4 (which got dropped, but adendums 1 and 2 didn't) - Intel went with neither in SB? For my work I still support SB and I've been designing something with FMA (the concept, not an extension) in mind, so this is a shock to me! – Alec Teal Jun 30 '19 at 16:35
  • @AlecTeal That's correct. Sandy Bridge has no FMA at all. There are some nontrivial technical reasons why Intel flip-flopped with the FMA3/4 stuff and why it took them so long to do any FMA for their x86 line. – Mysticial Jul 01 '19 at 04:01
  • @Mysticial I had no idea their hatred of not having /total/ control over AMD as a thing they can trot out when someone with "anti-trust" on a folder walks by was imbued so deep in the resulting hardware! I love they call it IA-64 now. The technical/political thing has been discussed to death elsewhere, on a serious note (and speaking of IA-64) - Intel has done FMA before, as did many other of the RISCs that got killed off by it (RIP ALPHA WE LOVE YOU) which became HP's bastard step-child from Compaq, HP was strongly backing Itanium. and this was well post standardisation of FP. – Alec Teal Jul 01 '19 at 21:02
  • In any event @Mysticial thanks for the news, no wonder I couldn't find it in the manual, I just told my self I'd missed it because Intel had their own name for it and the document was huge; the document for the thing that powers just shy of 90% of the stuff the thing works on. But that's a me problem :P Thanks again – Alec Teal Jul 01 '19 at 21:05
  • Maybe a good idea to mention scalar `fma()` from `math.h`, as a building-block that can be auto-vectorized. It seems to work around some missed-optimization bugs like in [Compiler is not producing FMA instructions for simple loop compiled for AVX512 CPU](https://stackoverflow.com/q/71902958) where GCC can vectorize for `-march=cascade-lake` but doesn't use FMA. – Peter Cordes Apr 17 '22 at 16:42
21

I tested the following code in GCC 5.3, Clang 3.7, ICC 13.0.1 and MSVC 2015 (compiler version 19.00).

float mul_add(float a, float b, float c) {
    return a*b + c;
}

__m256 mul_addv(__m256 a, __m256 b, __m256 c) {
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}

With the right compiler options (see below) every compiler will generate a vfmadd instruction (e.g. vfmadd213ss) from mul_add. However, only MSVC fails to contract mul_addv to a single vfmadd instruction (e.g. vfmadd213ps).

The following compiler options are sufficient to generate vfmadd instructions (except with mul_addv with MSVC).

GCC:   -O2 -mavx2 -mfma
Clang: -O1 -mavx2 -mfma -ffp-contract=fast
ICC:   -O1 -march=core-avx2
MSVC:  /O1 /arch:AVX2 /fp:fast

GCC 4.9 will not contract mul_addv to a single fma instruction but since at least GCC 5.1 it does. I don't know when the other compilers started doing this.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • See also `#pragma STDC FP_CONTRACT ON`. Stephen Canon points out that it allows contraction only within a single statement, not across statements. (http://lists.llvm.org/pipermail/cfe-dev/2015-September/045110.html). Also note that gcc enables contraction only with `-std=gnu*`, not with `-std=c11` or whatever. (And then it enables contraction across statements, beyond what IEEE + ISO C strictly allow). Another test function that uses separate variables might be worth trying. – Peter Cordes Sep 08 '17 at 19:37
  • @PeterCordes, see this https://stackoverflow.com/q/34436233/2542702 and Stephen Canon's answer. I think what GCC is doing is okay according toe Stephen's answer (assuming that GCC did not ignore `STDC FP_CONTRACT` which is unfortunately does last time I checked). – Z boson Sep 11 '17 at 11:17
  • Your question there only asks about `return a*b + c;`, not about `float mul = a*b; return mul + c;`. Read Stephen's mailing-list post carefully: he mentions that clang's `STDC FP_CONTRACT ON` only enables contraction within an expression, unlike clangs `-ffp-contract=fast` which would enable it for my second example in this comment, too. That's why clang has separate `on` vs. `fast` settings for the command-line option. See my recent edits to Mysticial's answer on this question. It's messier than I thought at first :( – Peter Cordes Sep 11 '17 at 17:19
  • @PeterCordes, one of my points is that GCC ignores `#pragma STDC FP_CONTRACT`. At least last time I checked. I should check this again (for e.g. gnuc99 and c99 or whatever). – Z boson Sep 13 '17 at 10:51
  • I think that's still true. And its actual behaviour goes beyond what `#pragma STDC FP_CONTRACT ON` allows, so it's not quite like defaulting that to ON and failing to provide a way to turn it off. I think from what I've read that IEEE + C doesn't specify a `#pragma STDC FP_CONTRACT FAST`, even though that is a *useful* setting. – Peter Cordes Sep 13 '17 at 15:28