Why does the FMA _mm256_fmadd_pd() intrinsic have 3 asm mnemonics, "vfmadd132pd", "231" and "213"?

Question

Could someone explain to me why there are 3 variants of the fused multiply-accumulate instruction: vfmadd132pd, vfmadd231pd and vfmadd213pd, while there is only one C intrinsics _mm256_fmadd_pd?

To make things simple, what is the difference between (in AT&T syntax)

vfmadd132pd  %ymm0, %ymm1, %ymm2
vfmadd231pd  %ymm0, %ymm1, %ymm2
vfmadd213pd  %ymm0, %ymm1, %ymm2

I did not get any idea from Intel's intrinsics guide. I ask because I see all of them in the assembler output of a chunk of C code I wrote. Thanks.

A clean answer (re-formating answers below)

For variant ijk, the meaning of vfmaddijkpd:

intel syntax: op(i) * op(j) + op(k) -> op(1)
AT&T syntax: op(4-i) * op(4-j) + op(4-k) -> op(3)

where op(n) denotes the n-th operand after the instruction. So there is a reverse transform between the two:

n <- 4 - n

Note that AVX and FMA are two separate things - there are CPUs which have AVX but not FMA - you should probably remove AVX from the title and tags to avoid confusion. — Paul R, Apr 04 '16 at 08:33
@PaulR, I think AVX2 would be an appropriate tag though. The OP is asking about FMA3 and all processors with FMA3 have AVX2. If you compile with `-mfma` without `-mavx2` with GCC it will recommend the `-mavx2` options as well. — Z boson, Apr 05 '16 at 12:38
@Zboson: apparently AMD [Piledriver](https://en.wikipedia.org/wiki/Piledriver_(microarchitecture)) has FMA3 but only AVX (not AVX2). — Paul R, Apr 05 '16 at 13:21
@PaulR, woah, I did not know that. It is a bit strange GCC gives this warning then since it's not true in every case. In fact Steamroller does not have AVX2 either but has FMA3. — Z boson, Apr 05 '16 at 13:25
@Zboson The idea is that if you're gonna compile for FMA3 without AVX2, then you should be compiling for FMA4 instead. — Mysticial, Jan 12 '17 at 19:24
@PaulR: FMA3 *depends on* AVX because it uses VEX encoding. I guess the AVX tag still isn't relevant to the instruction *naming*, though. — Peter Cordes, Oct 07 '19 at 07:52

score 14 · Accepted Answer · edited Oct 07 '19 at 05:04

The fused multiply-add instructions multiply two (packed) values, add a third value, and then overwrite one of the values with the result. Only one of the three values can be a memory operand rather than a register.

The way it works is that all three instructions overwrite ymm0 and allow only ymm2 to be a memory operand. The choice of instruction determines which two operands are multiplied and which is added.

Assuming that ymm0 is the first operand in Intel syntax (or the last in AT&T syntax):

vfmadd132pd:  ymm0 = ymm0 * ymm2/mem + ymm1
vfmadd231pd:  ymm0 = ymm1 * ymm2/mem + ymm0
vfmadd213pd:  ymm0 = ymm1 * ymm0 + ymm2/mem

When using the C intrinsics, this choice isn't necessary: The intrinsic does not overwrite a value but returns its result instead, and it allows all three values to be read from memory. The compiler will add memory reads/writes if needed, and will allocate a temporary register to store the result if it does not want any of the three values to be overwritten. It will choose one of the three instructions as it sees fit.

@AlphaBetaGamma Yes, I used Intel syntax which is also what almost all instruction set references you will find use. One of the many things wrong with AT&T syntax is that it switches the operand order to place the target at the end. In the case of these instructions, it also means that the `1`/`2`/`3` in the instruction name is no longer correct (in Intel syntax, 123 means that the instruction performs 1*2+3 where 1,2,3 is the order of operands). — interjay, Apr 03 '16 at 23:00

score 4 · Answer 2 · edited Oct 07 '19 at 07:54

This is in the assembly instruction set reference, and also in HTML extracts of it, like the entry for VFMADD*PD:

VFMADD132PD: Multiplies the two or four packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand, adds the infinite precision intermediate result to the two or four packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

VFMADD213PD: Multiplies the two or four packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the first source operand, adds the infinite precision intermediate result to the two or four packed double-precision floating-point values in the third source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

VFMADD231PD: Multiplies the two or four packed double-precision floating-point values from the second source to the two or four packed double-precision floating-point values in the third source operand, adds the infinite precision intermediate result to the two or four packed double-precision floating-point values in the first source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the desti- nation operand (first source operand).

Why is the second instruction applying the pattern 213 instead of just 123? Can you show a simple example which explains the difference? — Carl in 't Veld, Jul 10 '19 at 21:32

Why does the FMA _mm256_fmadd_pd() intrinsic have 3 asm mnemonics, "vfmadd132pd", "231" and "213"?

2 Answers2

Linked