loop unrolling not giving expected speedup for floating-point dot product

Question

 /* Inner product. Accumulate in temporary */
  void inner4(vec_ptr u, vec_ptr v, data_t *dest)
{
     long i;
     long length = vec_length(u);
     data_t *udata = get_vec_start(u);
     data_t *vdata = get_vec_start(v);
     data_t sum = (data_t) 0;

        for (i = 0; i < length; i++) {
                 sum = sum + udata[i] * vdata[i];
       }
  *dest = sum;
 }

Write a version of the inner product procedure described in the above problem that uses 6 × 1a loop unrolling . For x86-64, our measurements of the unrolled version give a CPE of 1.07 for integer data but still 3.01 for both floating-point data.

My code for 6*1a version of loop unrolling

 void inner4(vec_ptr u, vec_ptr v, data_t *dest){
       long i;
       long length = vec_length(u);
       data_t *udata = get_vec_start(u);
       data_t *vdata = get_vec_start(v);
       long limit = length -5;
       data_t sum = (data_t) 0;

      for(i=0; i<limit; i+=6){
             sum = sum +
                   ((udata[ i ] * vdata[ i ]
                  + udata[ i+1 ] * vdata[ i+1 ])
                  + (udata[ i+2 ] * vdata[ i+2 ]
                  + udata[ i+3 ] * vdata[ i+3 ]))
                   + ((udata[ i+4 ] * vdata[ i+4 ])
                  + udata[ i+5 ] * vdata[ i+5 ]);
      }
     for (i = 0; i < length; i++) {
             sum = sum + udata[i] * vdata[i];
   }
  *dest = sum;
      
 }

Question: Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00.

Any idea how to solve the problem?

Not sure what your "6 × 1a" terminology means, but the standard way is just to use `sum0`, `sum1`, etc. or a `sum[6]` array, and unroll by 6. Add at the end. See [AVX performance slower for bitwise xor op and popcount](https://stackoverflow.com/posts/comments/125940132) for another example of loop unrolling, and [Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell](https://stackoverflow.com/q/21090873) for another. (Those are also using SIMD; IDK if there's a canonical Q&A just using scalar unrolling) — Peter Cordes, Feb 26 '22 at 05:36
*Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00.* - each element takes 2 loads, and with only 2 load ports, that's a hard throughput bottleneck. Assuming you're counting an "element" as an `i` value, a pair of floats, one each from `udata[i]` and `vdata[i]`. The FP FMA throughput bottleneck is half that; Haswell is capable of two scalar FMAs per clock, but dot product takes 2 load per FMA. — Peter Cordes, Mar 03 '22 at 05:29
Re: your code not helping with the bottleneck: `sum + x + y + z` without `-ffast-math` is the same order of operations as `sum += x;` `sum += y;` ... so you haven't changed anything about the FP latency bottleneck. What would work is `sum += ...` as one way to unroll without multiple accumulators, because then the sum of each group of elements is independent. (It costs slightly more math operations that way, like starting with a mul and ending with an add, instead of contracting everything into FMAs. But that's not the bottleneck. — Peter Cordes, Mar 03 '22 at 05:35
See [AVX performance slower for bitwise xor op and popcount](https://stackoverflow.com/posts/comments/125944160) for an example of GCC doing that for you with `-ffast-math` with a manual unroll that failed to use multiple accumulators. — Peter Cordes, Mar 03 '22 at 05:37

score 2 · Accepted Answer · answered Mar 03 '22 at 06:16

Your unroll doesn't help with the FP latency bottleneck:

sum + x + y + z without -ffast-math is the same order of operations as sum += x; sum += y; ... so you haven't done anything about the single dependency chain running through all the + operations. Loop overhead (or front-end throughput) is not the bottleneck, it's the 3 cycle latency of addss on Haswell, so this unroll makes basically no difference.

What would work is sum += u[i]*v[i] + u[i+1]*v[i+1] + ... as a way to unroll without multiple accumulators, because then the sum of each group of elements is independent.

It costs slightly more math operations that way, like starting with a mul and ending with an add, but the middle ones can still contract into FMAs if you compile with -march=haswell. See comments on AVX performance slower for bitwise xor op and popcount for an example of GCC turning a naive unroll like sum += u0*v0; sum += u1*v1 into sum += u0*v0 + u1*v1;. In that case the problem was slightly different: sum of squared differences like sum += (u0-v0)**2 + (u1-v1)**2;, but it boils down to the same latency problem of ultimately doing some multiplies and adds.

The other way to solve the problem is with multiple accumulators, allowing all the operations to be FMAs. But Haswell has 5-cycle latency FMA, and 3-cycle latency addss, so doing the sum += ... addition on its own, not as part of an FMA, actually helps with the latency bottleneck on Haswell (unlike on Skylake add/sub/mul are all 4 cycle latency). The following all show unrolling with multiple accumulators, instead of with adding groups together like the first towards pairwise summation like you're doing:

FP math instruction throughput isn't the bottleneck for a big dot product on modern CPUs, only latency. Or load throughput if you unroll enough.

Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00.

Each element takes 2 loads, and with only 2 load ports, that's a hard throughput bottleneck. (https://agner.org/optimize/ / https://www.realworldtech.com/haswell-cpu/5/)

I'm assuming you're counting an "element" as an i value, a pair of floats, one each from udata[i] and vdata[i]. The FP FMA throughput bottleneck is also 2/clock on Haswell (whether they're scalar, 128-bit, or 256-bit vectors), but dot product takes 2 loads per FMA. In theory, even Sandybridge or maybe even K8 could achieve 1 element per clock, with separate mul and add instructions, since they both support 2 loads per clock, and have a wide enough pipeline to get load / mulss / addss through the pipeline with some room to spare.

loop unrolling not giving expected speedup for floating-point dot product

1 Answers1

Linked