Could someone explain to me why there are 3 variants of the fused multiply-accumulate instruction: vfmadd132pd
, vfmadd231pd
and vfmadd213pd
, while there is only one C intrinsics _mm256_fmadd_pd
?
To make things simple, what is the difference between (in AT&T syntax)
vfmadd132pd %ymm0, %ymm1, %ymm2
vfmadd231pd %ymm0, %ymm1, %ymm2
vfmadd213pd %ymm0, %ymm1, %ymm2
I did not get any idea from Intel's intrinsics guide. I ask because I see all of them in the assembler output of a chunk of C code I wrote. Thanks.
A clean answer (re-formating answers below)
For variant ijk
, the meaning of vfmaddijkpd
:
- intel syntax:
op(i) * op(j) + op(k) -> op(1)
- AT&T syntax:
op(4-i) * op(4-j) + op(4-k) -> op(3)
where op(n)
denotes the n-th operand after the instruction. So there is a reverse transform between the two:
n <- 4 - n