Your code uses AVX1 + FMA instructions, not AVX2. It would run ok on an AMD Piledriver, for example. (Assuming the hsum is implemented in a sane way, extracting the high half and then using 128-bit shuffles.).
If your AVX-only CPU doesn't have FMA either, you'd need to use _mm256_mul_ps
and _mm256_add_ps
.
For Intel, AVX2 and FMA were introduced in the same generation, Haswell, but those are different extensions. FMA is available in some CPUs without AVX2.
There is unfortunately even a VIA CPU with AVX2 but not FMA, otherwise AVX2 implies FMA unless you're in a VM or emulator that intentionally has a combination of extensions that real HW doesn't.
(There was an FMA4 extension in some AMD CPUs, with 4 operands (3 inputs and a separate output), Bulldozer through Zen1, after Intel pulled a switcheroo on AMD too late for them to change their Bulldozer design to support FMA3. That's why there's an AMD-only FMA4, and why it wasn't until Piledriver that AMD supported an FMA extension compatible with Intel. But that's part of the dust pile of history now, so usually we just say FMA to reference the extension that's technically called FMA3. See Agner Fog's 2009 blog
Stop the instruction set war, and How do I know if I can compile with FMA instruction sets?)
- AVX1: 256-bit FP only (no integer instructions except
vptest
, although FP in this case does include bitwise instructions like vxorps ymm
). Shuffles are only in-lane (e.g. vshufps ymm
or the new vpermilps
) or with 128-bit granularity (vperm2f128
or vinsertf128
/ vextractf128
). AVX1 also provides VEX encodings of all SSE1..4 instructions including integer, with 3-operand non-destructive. e.g. vpsubb xmm0, xmm1, [rdi]
- AVX2: 256-bit versions of integer SSE instructions, and new lane-crossing shuffles like
vpermps
/ vpermd
and vpermq / pd
, and vbroadcastss/sd ymm, xmm
with a register source (AVX1 only had vbroadcastss ymm, [mem]
). Also an efficient vpblendd
immediate integer blend instruction, like vblendps
- FMA3:
vfmadd213ps x/ymm, x/ymm, x/ymm/mem
and so on. (And pd and scalar ss/sd version). Also fmsub.. (subtract the 3rd operand), fnmadd.. (negate the product), and even fmaddsub...ps. _mm256_fmadd_ps
will compile to some form of vfmadd...ps
, depending on which input operand the compiler wants to overwrite, and which operand it wants to use as the memory operand.
This order of introduction explains the bad choice of intrinsic naming, e.g. _mm256_permute_ps
(immediate) and _mm256_permutevar_ps
(vector control) are AVX1 vpermilps
in-lane permute, with AVX2 getting saddled with _mm256_permutexvar_ps
. So confusingly the intrinsic has an x
for lane-crossing, while the asm mnemonic is just plain.