_mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps?

Question

I have an image processing algorithm to calculate a*b+c*d with AVX. The pseudo code is as follows:

float *a=new float[N];
float *b=new float[N];
float *c=new float[N];
float *d=new float[N];

//assign values to a, b, c and d
__m256 sum;
double start=cv::getTickCount();
for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8
{
    __m256 am=_mm256_loadu_ps(a+i);
    __m256 bm=_mm256_loadu_ps(b+i);
    __m256 cm=_mm256_loadu_ps(c+i);
    __m256 dm=_mm256_loadu_ps(d+i);

    __m256 abm=_mm256_mul_ps(am, bm);
    __m256 cdm=_mm256_mul_ps(cm, dm);
    __m256 abcdm=_mm256_add_ps(abm, cdm);
    sum=_mm256_add_ps(sum, abcdm);
}
double time1=(cv::getTickCount()-start)/cv::getTickFrequency();

I change _mm256_mul_ps and _mm256_add_ps on the above to _mm256_fmadd_ps as follows:

float *a=new float[N];
float *b=new float[N];
float *c=new float[N];
float *d=new float[N];

//assign values to a, b, c and d
__m256 sum;
double start=cv::getTickCount();
for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8
{
    __m256 am=_mm256_loadu_ps(a+i);
    __m256 bm=_mm256_loadu_ps(b+i);
    __m256 cm=_mm256_loadu_ps(c+i);
    __m256 dm=_mm256_loadu_ps(d+i);

    sum=_mm256_fmadd_ps(am, bm, sum);
    sum=_mm256_fmadd_ps(cm, dm, sum);
}
double time2=(cv::getTickCount()-start)/cv::getTickFrequency();

But the code below is slower than the above! The above code execution time1 is 50ms, the below code execution time2 is 90ms. _mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps ???

I use Ubuntu 16.04, GCC 7.5.0 ,compiler flags: -fopenmp -march=native -O3

How are you benchmarking this? You wouldn't be including all those calls to `new` in your timing, are you? — Andrew Henle, Feb 18 '21 at 13:11
@AndrewHenle Yes,not include new in my timing. I changed the code above. — New Coder, Feb 18 '21 at 13:27
This is all about latency vs throughput. In the first example you have only 1 `_mm256_add_ps` which depends on the result from the previous loop iteration, in the second you have two `_mm256_fmadd_ps` which depend on each other and the previous loop iteration. (There is probably a duplicate for this ...) — chtz, Feb 18 '21 at 13:30
@chtz: [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) is sort of a canonical for multiple accumulators, but it's pretty involved and not an exact dup for this (where the problem is making the dep chain longer with two FMAs into `sum`). I decided it might be worth posting a summary answer instead of just the link. — Peter Cordes, Feb 18 '21 at 13:44
There is also this answer that tackles this problem: https://stackoverflow.com/questions/65818232/improving-performance-of-floating-point-operations-with-simd/65827668#65827668 — Andrey Semashev, Feb 18 '21 at 15:00
@AndreySemashev: Oh yeah, that's a decent beginner example for a simple dot-product with multiple accumulators; added a link to my answer. We were I think lacking a canonical duplicate for a dot product of an array (not just a single vector). I retitled it, main downside is the code in the question is almost *too* naive (the braindead loads), but it's still pretty good. — Peter Cordes, Feb 19 '21 at 02:02

Peter Cordes · Answer 1 · 2021-02-19T02:00:07.043

Your reduction loops both bottleneck on latency, not throughput, because you're only using one FP vector accumulator. The FMA one is slower because you made the critical path longer (a chain of 2 instructions per loop iteration instead of just 1).

In the add case, the loop carried dependency chain for sum is only sum=_mm256_add_ps(sum, abcdm);. The other instructions are independent for each iteration, and can have that abcdm input ready to go before the previous vaddps has this iteration's sum ready.

In the fma case, the loop-carried dep chain goes through two _mm256_fmadd_ps operations, both into sum, so yes, you'd expect it to be about twice as slow.

Unroll with more accumulators to hide FP latency (like normal for a dot product). See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for much more detail about that and how OoO exec works.

Also see Improving performance of floating-point dot-product of an array with SIMD for a much simpler beginner-friendly example of 2 accumulators.

(Adding up those separate __m256 sum0, sum1, sum2, etc vars should be done after the loop. You can also use __m256 sum[4] to save typing. You can even use an inner loop over that array; most compilers will fully unroll small fixed-count loops, so you get the desired unrolled asm with each __m256 in a separate YMM register.)

Or let clang auto-vectorize this; it will normally do that unrolling with multiple accumulators for you.

Or if you for some reason didn't want to unroll, you could use FMA while keeping the loop-carried latency low with sum += fma(a, b, c*d); (one mul, one FMA, one add). Of course assuming your compiler didn't "contract" your mul and add into FMA for you if you compiled with -ffast-math; GCC will do that aggressively across statements by default, clang won't.

Once you do this, your throughput will bottleneck on 2 loads per clock (best case even with aligned arrays for no cache-line splits, which new won't give you), so using FMA barely helps except to reduce the front-end bottleneck. (Compared to a multiple accumulator mul/add version that needs to run 1 FP op per load to keep up; using multiple accumulators will let you go faster than either original loop. Like one iteration (4 loads) per 2 cycles, instead of 1 per 3 cycles with the vaddps latency bottleneck).

On Skylake and later, FMA/add/mul all have the same latency: 4 cycles. On Haswell/Broadwell, vaddps latency is 3 cycles (one dedicated FP add unit) while FMA latency is 5.

Zen2 has 3 cycle vaddps, 5 cycle vfma....ps (https://uops.info/). (2/clock throughput for both, and on different execution ports, so you could in theory run 2 FMAs and 2 vaddps per clock on Zen2.)

With your longer-latency FMA loop being less than twice as slow, I'm guessing you might be on a Skylake-derived CPU. Perhaps the mul/add version was bottlenecking a bit on the front-end or resource conflicts or something and not quite achieving the expected 1 iteration per 3 clocks latency-limited speed.

In general, see https://uops.info/ for latency and uops / port breakdowns. (also https://agner.org/optimize/).

Some CPUs (maybe not x86?) have a different latency with respect to the different arguments of FMA, they first start the multiplication, and only need the accumulator a cycle or two later. Anyway, FMA exists foremost for the extra precision, which makes it a more complicated operation than mul+add, it wouldn't be shocking for it to be more expensive, even if in practice it seldom is, except for the latency that is the heart of this question. — Marc Glisse, Feb 19 '21 at 09:29
@MarcGlisse: Indeed, [Aarch64 what is late-forwarding?](https://stackoverflow.com/q/66212284) is an example of that. But no x86 CPUs work that way. The way they do OoO scheduling, all inputs have to be ready before sending the uop to the execution port. https://uops.info/ did actually test this, e.g. [for `vfmadd213ps` on Skylake](https://uops.info/html-lat/SKL/VFMADD213PS_XMM_XMM_XMM-Measurements.html) the latency is 4 cycles from each of the 3 inputs to the output. (The main table would show [5:4] or similar if there were multiple latencies, but no for any of Zen 1/2, HSW/SKL/ICL) — Peter Cordes, Feb 19 '21 at 09:49
They did measure 5:4 latency for Cascade Lake: https://uops.info/html-instr/VFMADD231PS_XMM_XMM_XMM.html — amonakov, Feb 19 '21 at 10:40
Where does this claim come from: *The way they do OoO scheduling, all inputs have to be ready before sending the uop to the execution port*? [Elsewhere](https://github.com/travisdowns/uarch-bench/issues/81) there is evidence that OoO scheduling is speculative, reissuing uops when an input ends up not ready. — amonakov, Feb 19 '21 at 10:47
@amonakov: fair point, "expected to be ready that cycle" would be more precise. For forwarding from other ALU instructions with known latency, the cycle the result can be put on the bypass-forwarding network is known by the scheduler. (And in later cycles it can be read from the physical register.) But yes, forwarding from memory optimistically hopes for an L1d hit and may have to replay uops dependent on the load result, on Intel at least. But AFAIK, it's still true that a single uop can't start (*successfully* dispatch) if not all its inputs are ready, which is the point I intended. — Peter Cordes, Feb 19 '21 at 10:53
@amonakov: Thanks for the Cascade Lake result; those XMM FMAs should be running on p0/p1, so any possible weirdness from the extra 512-bit FMA unit on port 5 shouldn't be coming into it. The dep-breaking `vmovupd xmm1,xmm15` seems to be copying from a register that was only ever initialized by `vzeroupper`, not written by an earlier math OP; makes me wonder about [Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says](https://stackoverflow.com/q/64116679) (extra bypass latency "infects" a register indefinitely), tests with diff versions of the nanobench SW could have this? — Peter Cordes, Feb 19 '21 at 11:00

_mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps?

1 Answers1

Linked