4

I've created two versions of a dot product in .NET using AVX-256 instructions. One uses fused multiply add, and the other separated out into a multiply and and add.

public static unsafe Vector256<double> Dot(double* x, double* y, int n)
{
    var vresult = Vector256<double>.Zero;
    int i = 0;

    for (; i < n; i += 4)
        vresult = Avx.Add(Avx.Multiply(Avx.LoadVector256(x + i), Avx.LoadVector256(y + i)), vresult);

    return vresult;
}

public static unsafe Vector256<double> Dot2(double* x, double* y, int n)
{
    var vresult = Vector256<double>.Zero;
    int i = 0;

    for (; i < n; i += 4)
        vresult = Fma.MultiplyAdd(Avx.LoadVector256(x + i), Avx.LoadVector256(y + i), vresult);

    return vresult;
}

This compiles to the following JIT Asm

C.Dot(Double*, Double*, Int32)
    L0000: vzeroupper
    L0003: vxorps ymm0, ymm0, ymm0
    L0008: xor eax, eax
    L000a: test r9d, r9d
    L000d: jle short L002b
    L000f: nop
    L0010: movsxd r10, eax
    L0013: vmovupd ymm1, [rdx+r10*8]
    L0019: vmulpd ymm1, ymm1, [r8+r10*8]
    L001f: vaddpd ymm0, ymm1, ymm0
    L0023: add eax, 4
    L0026: cmp eax, r9d
    L0029: jl short L0010
    L002b: vmovupd [rcx], ymm0
    L002f: mov rax, rcx
    L0032: vzeroupper
    L0035: ret

C.Dot2(Double*, Double*, Int32)
    L0000: vzeroupper
    L0003: vxorps ymm0, ymm0, ymm0
    L0008: xor eax, eax
    L000a: test r9d, r9d
    L000d: jle short L002b
    L000f: nop
    L0010: movsxd r10, eax
    L0013: vmovupd ymm1, [rdx+r10*8]
    L0019: vfmadd132pd ymm1, ymm0, [r8+r10*8]
    L001f: vmovaps ymm0, ymm1
    L0023: add eax, 4
    L0026: cmp eax, r9d
    L0029: jl short L0010
    L002b: vmovupd [rcx], ymm0
    L002f: mov rax, rcx
    L0032: vzeroupper
    L0035: ret

When I benchmark this code using my Intel processor and benchmark.net, I see a modest speedup as expected. But when I run it on my AMD Ryzen 5900X, it's about 30% slower on nearly every size of array. Is this a bug in AMD's implementation of vfmadd132pd or in Microsoft's compiler?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
kolbe
  • 81
  • 3
  • 1
    The first loop (without FMA) has better potential for parallel execution of multiplications across multiple loop iterations. `vmulpd` has no dependency on the previous results of `vmulpd` or `vaddpd` while `vfmadd132pd` does depend on the past result of `vfmadd132pd`. This might explain the difference: Intel CPU is able to exploit this parallelism while AMD CPU is not for some reason. – Andrey Semashev Jun 04 '22 at 20:30
  • 1
    I should note that the loop is not taking the full advantage of the instruction-level parallelism, and therefore is not very optimal (if not memory-bound, it is latency-bound). If you can spare the rounding, you should use multiple accumulators in the loop and combine them in the end. – Andrey Semashev Jun 04 '22 at 20:33
  • 1
    Also, the leading `vzeroupper` is redundant. But I guess it's a JIT compiler bug. – Andrey Semashev Jun 04 '22 at 20:35
  • > Intel CPU is able to exploit this parallelism while AMD CPU is not -- I meant the other way around, of course. The CPU that runs the first loop faster is able to extract better performance from ILP. – Andrey Semashev Jun 04 '22 at 21:23
  • Just an option: cast pointer `Vector256* vx = (Vector256*)x;` then loop through it by vectors count instead of doubles count. You may access the vector like `vx[i]` then, where `i` is vector index. With this approach JIT may apply some more optimizations to the output code such as ommiting redundant range checks or using increments instead of adds. – aepot Jun 26 '22 at 10:20
  • @aepot Good idea. I tried it in my [LitMath library](https://github.com/matthewkolbe/LitMath), but benchmarking was about the same, and the ASM didn't look any better. – kolbe Jul 09 '22 at 20:31

3 Answers3

3

Near duplicate of this Q&A about Unrolling dot-product loops with multiple accumulators - you bottleneck on vaddpd or vfma...pd latency, not throughput, and yes Zen3 has lower latency FP vaddpd (3 cycles) than FMA (4 cycles). https://uops.info/

Intel Skylake / Ice Lake CPUs have 4-cycle latency for all FP add/sub/mul/fma operations, running them on the same execution units. I wouldn't have expected a speedup from FMA, since the front-end shouldn't be a bottleneck. Maybe in the add version, sometimes an independent multiply delays an add by a cycle, hurting the critical path dependency chain? Unlikely; oldest-ready-first uop scheduling should mean the independent work (multiplies) are way ahead of the adds.

Intel Haswell, Broadwell, and Alder Lake have vaddpd latency of 3 cycles, less than their FMA latencies, so you'd see a benefit there.


But if you unroll with multiple accumulators, you can hide FP latency and bottleneck on throughput. Or on load throughput, since you need 2 loads per FMA.

AMD does have the FP throughput to run 2 multiplies and 2 adds per clock on Zen2 and later, but Intel doesn't. Although with the load bottleneck you'd only get 1 each per clock anyway.

See also Latency bounds and throughput bounds for processors for operations that must occur in sequence re: dependency chains and latency, and https://agner.org/optimize/ (microarchitecture and asm guides.) Also https://uops.info/ for better instruction tables than Agner's.

Your current version is taking advantage of the data parallelism in a dot product using SIMD, more work per instruction. But you're not doing anything to let the CPU find any instruction-level parallelism between those SIMD vector operations. So you're missing one of the three factors that can scale performance (SIMD parallelism, ILP, and thread-level parallelism for huge arrays.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

You can see instruction latencies and throughputs at https://www.agner.org/optimize/instruction_tables.pdf. vfmadd132pd is not slow on Zen 3 compared to recent Intel * Lake processors.

jbapple
  • 3,297
  • 1
  • 24
  • 38
1

I don’t think there’re bugs anywhere. The story is more interesting than that.

On modern Intel CPUs, floating-point addition, multiplication and FMA instructions are all competing for two execution units P0 and P1.

On modern AMD CPUs, multiplication and FMA instructions can run on FP0 or FP1 execution units, while floating-point addition runs on FP2 or FP3 execution units. For this reason, when throughput is measured in instructions/second instead of FLOPs, an instruction mix of 50% additions / 50% multiplications which you have in your Dot method is twice as fast compared to a program of 100% FMA instructions.

Soonts
  • 20,079
  • 9
  • 57
  • 130