I've created two versions of a dot product in .NET using AVX-256 instructions. One uses fused multiply add, and the other separated out into a multiply and and add.
public static unsafe Vector256<double> Dot(double* x, double* y, int n)
{
var vresult = Vector256<double>.Zero;
int i = 0;
for (; i < n; i += 4)
vresult = Avx.Add(Avx.Multiply(Avx.LoadVector256(x + i), Avx.LoadVector256(y + i)), vresult);
return vresult;
}
public static unsafe Vector256<double> Dot2(double* x, double* y, int n)
{
var vresult = Vector256<double>.Zero;
int i = 0;
for (; i < n; i += 4)
vresult = Fma.MultiplyAdd(Avx.LoadVector256(x + i), Avx.LoadVector256(y + i), vresult);
return vresult;
}
This compiles to the following JIT Asm
C.Dot(Double*, Double*, Int32)
L0000: vzeroupper
L0003: vxorps ymm0, ymm0, ymm0
L0008: xor eax, eax
L000a: test r9d, r9d
L000d: jle short L002b
L000f: nop
L0010: movsxd r10, eax
L0013: vmovupd ymm1, [rdx+r10*8]
L0019: vmulpd ymm1, ymm1, [r8+r10*8]
L001f: vaddpd ymm0, ymm1, ymm0
L0023: add eax, 4
L0026: cmp eax, r9d
L0029: jl short L0010
L002b: vmovupd [rcx], ymm0
L002f: mov rax, rcx
L0032: vzeroupper
L0035: ret
C.Dot2(Double*, Double*, Int32)
L0000: vzeroupper
L0003: vxorps ymm0, ymm0, ymm0
L0008: xor eax, eax
L000a: test r9d, r9d
L000d: jle short L002b
L000f: nop
L0010: movsxd r10, eax
L0013: vmovupd ymm1, [rdx+r10*8]
L0019: vfmadd132pd ymm1, ymm0, [r8+r10*8]
L001f: vmovaps ymm0, ymm1
L0023: add eax, 4
L0026: cmp eax, r9d
L0029: jl short L0010
L002b: vmovupd [rcx], ymm0
L002f: mov rax, rcx
L0032: vzeroupper
L0035: ret
When I benchmark this code using my Intel processor and benchmark.net, I see a modest speedup as expected. But when I run it on my AMD Ryzen 5900X, it's about 30% slower on nearly every size of array. Is this a bug in AMD's implementation of vfmadd132pd
or in Microsoft's compiler?