See @harold's answer for an actual improvement. This is mostly to repost what I wrote in comments.
four instead of two sums in the inner loop. (Why doesn't unrolling help?)
There's no loop-carried dependency through sum[i]
. The next iteration assigns sum[0] = _mm256_load_pd(C+i1*SIZE_K+j1+0);
which has no dependency on the previous value.
Therefore register-renaming of the same architectural register onto different physical registers is sufficient to avoid write-after-write hazards that might stall the pipeline. No need to complicate the source with multiple tmp variables. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (In that question, one dot product of 2 arrays, there is a loop carried dependency through an accumulator. There, using multiple accumulators is valuable to hide FP FMA latency so we bottleneck on FMA throughput, not latency.)
A pipeline without register renaming (most in-order CPUs) would benefit from "software pipelining" to statically schedule for what out-of-order exec can do on the fly: load into different registers so there's distance (filled with independent work) between each load and the FMA that consumes it. And then between that and the store.
But all modern x86 CPUs are OoO; even Knight's Landing has some limited OoO exec for SIMD vectors. (Silvermont doesn't support AVX, but does run SIMD instructions in-order, only doing OoO for integer).
Without any multiple-accumulator situation to hide latency, the benefits of unrolling (explicitly in the source or with -funroll-loop as enabled by -fprofile-use, or in clang by default) are:
- Reduce front-end bandwidth to get the loop overhead into the back-end. More useful-work uops per loop overhead. Thus it helps if your "useful work" is close to bottlenecked on the front end.
- Less back-end execution-unit demand for running the loop overhead. Normally not a problem on Haswell and later, or Zen; the back end can mostly keep up with the front-end when the instruction mix includes some integer stuff and some pure load instructions.
- Fewer total uops per work done means OoO exec can "see" farther ahead for memory loads/stores.
- Sometimes better branch prediction for short-running loops: The lower iteration count means a shorter pattern for branch prediction to learn. So for short trip-counts, a better chance of correctly predicting the not-taken for the last iteration when execution falls out of the loop.
- Sometimes save a
mov reg,reg
in more complicated cases where it's easier for the compiler to generate a new result in a different reg. The same variable can alternate between living in two regs instead of needing to get mov
ed back to the same one to be ready for the next iteration. Especially if you have a loop that uses a[i]
and a[i+1]
in a dependent way, or something like Fibonacci.
With 2 loads + 1 store in the loop, that will probably be the bottleneck, not FMA or front-end bandwidth. Unrolling by 2 might have helped avoid a front-end bottleneck, but more than that would only matter with contention from another hyperthread.
An interesting question came up in comments: doesn't unrolling need a lot of registers to be useful?
Harold commented:
16 is not a huge number of registers, but it's enough to have 12
accumulators and 3 pieces of row vector from B and the broadcasted
scalar from A, so it works out to just about enough. The loop from OP
above barely uses any registers anyway. The 8 registers in 32bit are
indeed too few.
Of course since the code in the question doesn't have "accumulators" in registers across loop iterations, only adding into memory, compilers could have optimized all of sum[0..n]
to reuse the same register in asm; it's "dead" after storing. So actual register pressure is very low.
Yes x86-64 is somewhat register-poor, that's why AVX512 doubles the number as well as width of vector regs (zmm0..31). Yes, many RISCs have 32 int / 32 fp regs, including AArch64 up from 16 in ARM.
x86-64 has 16 scalar integer registers (including the stack pointer, not including the program counter), so normal functions can use 15. There are also 16 vector regs, xmm0..15. (And with AVX they're double the width ymm0..15).
(Some of this was written before I noticed that sum[0..n]
was pointless, not loop-carried.)
Register renaming onto a large physical register file is sufficient in this case. There are other cases where having more architectural registers helps, especially for higher FP latency hence why AVX512 has 32 zmm regs. But for integer 16 is close to enough. RISC CPUs were often designed for in-order without reg renaming, needing SW pipeline.
With OoO exec, the jump from 8 to 16 architectural GP integer regs is more significant than a jump from 16 to 32 would be, in terms of reducing spill/reloads. (I've seen a paper that measured total dynamic instruction count for SPECint with various numbers of architectural registers. I didn't look it up again, but 8->16 might have been 10% total saving while 16->32 was only a couple %. Something like that).
But this specific problem doesn't need a lot of FP registers, only 4 vectors for sum[0..3]
(if they were loop-carried) and maybe 1 temporary; x86 can use memory-source mul/add/FMA. Register renaming removes any WAW hazards so we can reuse the same temporary register instead of needing software pipelining. (And OoO exec also hides load and ALU latency.)
You want multiple accumulators when there are loop-carried dependencies. This code is adding into memory, not into a few vector accumulators, so any dependency is through store/reload. But that only has ~7 cycle latency so any sane cache-blocking factor hides it.