How to further optimize performance of Matrix Multiplication?

Question

I am trying to optimize my matrix multiplication code running on a single core. How can I futher improve the performance in regards to loop unrolling, FMA/SSE? I'm also curious to know why the performance won't increase if you use four instead of two sums in the inner loop.

The problem size is a 1000x1000 matrix multiplication. Both gcc 9 and icc 19.0.5 are available. Intel Xeon @ 3.10GHz, 32K L1d Cache, Skylake Architecture. Compiled with gcc -O3 -mavx.

void mmult(double* A, double* B, double* C)
{
    const int block_size = 64 / sizeof(double);

    __m256d sum[2], broadcast;

    for (int i0 = 0; i0 < SIZE_M; i0 += block_size) {
    for (int k0 = 0; k0 < SIZE_N; k0 += block_size) {
    for (int j0 = 0; j0 < SIZE_K; j0 += block_size) {

        int imax = i0 + block_size > SIZE_M ? SIZE_M : i0 + block_size;
        int kmax = k0 + block_size > SIZE_N ? SIZE_N : k0 + block_size;
        int jmax = j0 + block_size > SIZE_K ? SIZE_K : j0 + block_size;

        for (int i1 = i0; i1 < imax; i1++) {

            for (int k1 = k0; k1 < kmax; k1++) {

                broadcast = _mm256_broadcast_sd(A+i1*SIZE_N+k1);
                for (int j1 = j0; j1 < jmax; j1+=8) {
                    sum[0] = _mm256_load_pd(C+i1*SIZE_K+j1+0);
                    sum[0] = _mm256_add_pd(sum[0], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+0)));
                    _mm256_store_pd(C+i1*SIZE_K+j1+0, sum[0]);

                    sum[1] = _mm256_load_pd(C+i1*SIZE_K+j1+4);
                    sum[1] = _mm256_add_pd(sum[1], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+4)));
                    _mm256_store_pd(C+i1*SIZE_K+j1+4, sum[1]);

                    // doesn't improve performance.. why?
                    // sum[2] = _mm256_load_pd(C+i1*SIZE_K+j1+8);
                    // sum[2] = _mm256_add_pd(sum[2], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+8)));
                    // _mm256_store_pd(C+i1*SIZE_K+j1+8, sum[2]);

                    // sum[3] = _mm256_load_pd(C+i1*SIZE_K+j1+12);
                    // sum[3] = _mm256_add_pd(sum[3], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+12)));
                    // _mm256_store_pd(C+i1*SIZE_K+j1+4, sum[3]);
                }
            }
        }
    }
    }
    }
}

In general, loop unrolling only improves performance if your unrolled loop has unused registers. If you go too far, your unrolled loops will start register spills/fills and slow processing down. Note that the x86 architecture is already register-poor, so loop unrolling often doesn't work well on x86 systems. But to really know why, you'd have to do detailed profiling on both versions of your code. Given your posted code has six levels of nested loops, you might actually get better performance from splitting your inner loop(s). — Andrew Henle, Nov 26 '19 at 12:55
@AndrewHenle: x86-64 has 16 YMM registers; this would *not* be a problem. Not even in 32-bit code with 8x YMM registers. More likely you see no benefit because load+store is the bottleneck, not front-end bandwidth or back-end ALU ports. — Peter Cordes, Nov 26 '19 at 13:54
What CPU, what compiler options + version? **What problem sizes.** Are we talking about matrices that fit in L1d or L2 cache? Or L3 or main memory? — Peter Cordes, Nov 26 '19 at 13:56
@PeterCordes The problem size is a 1000x1000 matrix multiplication. Both gcc 9 and icc 19.0.5 are available. Intel Xeon @ 3.10GHz, 32K L1d Cache. — , Nov 26 '19 at 16:46
So put that in your question, along with the actual options used. GCC defaults to `-O0`. Also, the only cache parameter that's actually variable is L3 cache size. (And per-core private L2 cache size for "Scalable Xeon (1MiB L2)" vs. Broadwell or older Xeon (256kiB L2)). — Peter Cordes, Nov 26 '19 at 19:06
Oh, you didn't use `-march=native` or at least `-mtune=native`? You did use `_mm256_load_pd`, not `loadu`, so the "generic" tuning won't split your 256-bit loads/stores into two halves. That would have been a bottleneck. I don't think there are any other obvious problems from not enabling other related ISA extensions or tuning for whatever Xeon you actually have. (You still haven't specified what microarchitecture. "Xeon" covers everything from Pentium 4 to Cascade Lake. (Although AVX limits it to Sandybridge through Cascade Lake, but that's still a big diff in L1d load throughput.) — Peter Cordes, Nov 26 '19 at 23:27
Transpose B before multiplication and multiply row by row rather than row by column? Just a wild guess. I don't understand avx code :( — n. m. could be an AI, Nov 26 '19 at 23:33
@n.'pronouns'm. Transposing is done "implicitly" by moving the k-loop one step further up I think. So it should be already row by row. — , Nov 26 '19 at 23:36
@PeterCordes It's a Skylake Architecture. So using loadu would be better? What about the sums in the inner loop? Ans what about FMA? — , Nov 26 '19 at 23:41
That's the opposite of what I said. I said that if you had used `loadu`, the tune=generic that you used would be a problem. GCC defaults to `-ffp-contract=fast` so you get FMA from this source if you compile this with `-march=native` or `-march=skylake` (client, not `skylake-avx512`). — Peter Cordes, Nov 26 '19 at 23:50
I would have thought that using 3 or 4 accumulators would help to hide FP latency, but you haven't shown your `block_size` so we don't know how you tuned the cache blocking. — Peter Cordes, Nov 26 '19 at 23:51
@PeterCordes *x86-64 has 16 YMM registers; this would not be a problem. Not even in 32-bit code with 8x YMM registers.* I truly respect your opinions and your posts, but that's half the number of 64-bit VIS registers on an UltraSPARC III CPU from 20 years ago. I'm not sure how many registers a SPARC M8 has, but I doubt those numbers have gone down. [The IBM POWER9 register spec](https://wiki.raptorcs.com/w/images/9/95/POWER9_Registers_vol3_version1.2_pub.pdf) has 16 **pages** of line-after-line of nothing but register mnemonics. I didn't bother counting them. I say x86 is "register poor". — Andrew Henle, Nov 27 '19 at 00:14
16 is not a huge number of registers, but it's enough to have 12 accumulators and 3 pieces of row vector from `B` and the broadcasted scalar from `A`, so it works out to just about enough. The loop from OP above barely uses any registers anyway. The 8 registers in 32bit are indeed too few. — harold, Nov 27 '19 at 00:18
@AndrewHenle: Sure, good question, and yes x86-64 is somewhat register-poor, that's why AVX512 doubles the number as well as width of vector regs. Yes, many RISCs have 32 int / 32 fp regs, including AArch64 up from 16 in ARM. But this specific problem doesn't need a lot of FP registers, only 4 vector accumulators for `sum[0..3]` and maybe 1 temporary; x86 can use memory-source mul/add/FMA. Register renaming removes any WAW hazards so we can reuse the same temporary register instead of needing software pipelining. (And OoO exec also hides load and ALU latency.) — Peter Cordes, Nov 27 '19 at 00:28
@AndrewHenle: In fact now that I look at it, `sum[i]` are not loop carried and are 100% pointless vs. a single temporary. See my answer on [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](//stackoverflow.com/q/45113527) for more, for a case where there *are* loop-carried dependencies and you do want multiple accumulators. This code is adding into memory, not into a few vector accumulators, so any dependency is through store/reload. But that only has ~7 cycle latency so any sane cache-blocking factor hi — Peter Cordes, Nov 27 '19 at 00:30
@AndrewHenle: TL:DR: for a problem without actual loop-carried dependencies, unrolling (explicitly in the source or with `-funroll-loop` as enabled by `-fprofile-use`, or in clang by default) just helps with back-end integer loop overhead + front-end bandwidth. Register renaming onto a large physical register file is sufficient in this case. There are other cases where having more *architectural* registers helps, especially for higher FP latency hence AVX512 32 zmm regs. But for integer 16 is close to enough. RISC CPUs were often designed for in-order without reg renaming, needing SW pipeline — Peter Cordes, Nov 27 '19 at 00:36

score 2 · Accepted Answer · edited Nov 28 '19 at 04:17

This code has 2 loads per FMA (if FMA-contraction happens), but Skylake only supports at most one load per FMA in theory (if you want to max out 2/clock FMA throughput), and even that is usually too much in practice. (Peak through is 2 loads + 1 store per clock, but it usually can't quite sustain that). See Intel's optimization guide and https://agner.org/optimize/

The loop overhead is not the biggest problem, the body itself forces the code to run at half speed.

If the k-loop was the inner loop, a lot of accumulation could be chained, without having to load/store to and from C. This has a downside: with a loop-carried dependency chain like that, it would be up to to code to explicitly ensure that there is enough independent work to be done.

In order to have few loads but enough independent work, the body of the inner loop could calculate the product between a small column vector from A and a small row vector from B, for example using 4 scalar broadcasts to load the column and 2 normal vector loads from B, resulting in just 6 loads for 8 independent FMAs (even lower ratios are possible), which is enough independent FMAs to keep Skylake happy and not too many loads. Even a 3x4 footprint is possible, which also has enough independent FMAs to keep Haswell happy (it needs at least 10).

I happen to have some example code, it's for single precision and C++ but you'll get the point:

sumA_1 = _mm256_load_ps(&result[i * N + j]);
sumB_1 = _mm256_load_ps(&result[i * N + j + 8]);
sumA_2 = _mm256_load_ps(&result[(i + 1) * N + j]);
sumB_2 = _mm256_load_ps(&result[(i + 1) * N + j + 8]);
sumA_3 = _mm256_load_ps(&result[(i + 2) * N + j]);
sumB_3 = _mm256_load_ps(&result[(i + 2) * N + j + 8]);
sumA_4 = _mm256_load_ps(&result[(i + 3) * N + j]);
sumB_4 = _mm256_load_ps(&result[(i + 3) * N + j + 8]);

for (size_t k = kk; k < kk + akb; k++) {
    auto bc_mat1_1 = _mm256_set1_ps(*mat1ptr);
    auto vecA_mat2 = _mm256_load_ps(mat2 + m2idx);
    auto vecB_mat2 = _mm256_load_ps(mat2 + m2idx + 8);
    sumA_1 = _mm256_fmadd_ps(bc_mat1_1, vecA_mat2, sumA_1);
    sumB_1 = _mm256_fmadd_ps(bc_mat1_1, vecB_mat2, sumB_1);
    auto bc_mat1_2 = _mm256_set1_ps(mat1ptr[N]);
    sumA_2 = _mm256_fmadd_ps(bc_mat1_2, vecA_mat2, sumA_2);
    sumB_2 = _mm256_fmadd_ps(bc_mat1_2, vecB_mat2, sumB_2);
    auto bc_mat1_3 = _mm256_set1_ps(mat1ptr[N * 2]);
    sumA_3 = _mm256_fmadd_ps(bc_mat1_3, vecA_mat2, sumA_3);
    sumB_3 = _mm256_fmadd_ps(bc_mat1_3, vecB_mat2, sumB_3);
    auto bc_mat1_4 = _mm256_set1_ps(mat1ptr[N * 3]);
    sumA_4 = _mm256_fmadd_ps(bc_mat1_4, vecA_mat2, sumA_4);
    sumB_4 = _mm256_fmadd_ps(bc_mat1_4, vecB_mat2, sumB_4);
    m2idx += 16;
    mat1ptr++;
}
_mm256_store_ps(&result[i * N + j], sumA_1);
_mm256_store_ps(&result[i * N + j + 8], sumB_1);
_mm256_store_ps(&result[(i + 1) * N + j], sumA_2);
_mm256_store_ps(&result[(i + 1) * N + j + 8], sumB_2);
_mm256_store_ps(&result[(i + 2) * N + j], sumA_3);
_mm256_store_ps(&result[(i + 2) * N + j + 8], sumB_3);
_mm256_store_ps(&result[(i + 3) * N + j], sumA_4);
_mm256_store_ps(&result[(i + 3) * N + j + 8], sumB_4);

This means that the j-loop and the i-loop are unrolled, but not the k-loop, even though it is the inner loop now. Unrolling the k-loop a bit did help a bit in my experiments.

Peter Cordes · Answer 2 · 2019-11-29T03:01:23.980

See @harold's answer for an actual improvement. This is mostly to repost what I wrote in comments.

four instead of two sums in the inner loop. (Why doesn't unrolling help?)

There's no loop-carried dependency through sum[i]. The next iteration assigns sum[0] = _mm256_load_pd(C+i1*SIZE_K+j1+0); which has no dependency on the previous value.

Therefore register-renaming of the same architectural register onto different physical registers is sufficient to avoid write-after-write hazards that might stall the pipeline. No need to complicate the source with multiple tmp variables. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (In that question, one dot product of 2 arrays, there is a loop carried dependency through an accumulator. There, using multiple accumulators is valuable to hide FP FMA latency so we bottleneck on FMA throughput, not latency.)

A pipeline without register renaming (most in-order CPUs) would benefit from "software pipelining" to statically schedule for what out-of-order exec can do on the fly: load into different registers so there's distance (filled with independent work) between each load and the FMA that consumes it. And then between that and the store.

But all modern x86 CPUs are OoO; even Knight's Landing has some limited OoO exec for SIMD vectors. (Silvermont doesn't support AVX, but does run SIMD instructions in-order, only doing OoO for integer).

Without any multiple-accumulator situation to hide latency, the benefits of unrolling (explicitly in the source or with -funroll-loop as enabled by -fprofile-use, or in clang by default) are:

Reduce front-end bandwidth to get the loop overhead into the back-end. More useful-work uops per loop overhead. Thus it helps if your "useful work" is close to bottlenecked on the front end.
Less back-end execution-unit demand for running the loop overhead. Normally not a problem on Haswell and later, or Zen; the back end can mostly keep up with the front-end when the instruction mix includes some integer stuff and some pure load instructions.
Fewer total uops per work done means OoO exec can "see" farther ahead for memory loads/stores.
Sometimes better branch prediction for short-running loops: The lower iteration count means a shorter pattern for branch prediction to learn. So for short trip-counts, a better chance of correctly predicting the not-taken for the last iteration when execution falls out of the loop.
Sometimes save a mov reg,reg in more complicated cases where it's easier for the compiler to generate a new result in a different reg. The same variable can alternate between living in two regs instead of needing to get moved back to the same one to be ready for the next iteration. Especially if you have a loop that uses a[i] and a[i+1] in a dependent way, or something like Fibonacci.

With 2 loads + 1 store in the loop, that will probably be the bottleneck, not FMA or front-end bandwidth. Unrolling by 2 might have helped avoid a front-end bottleneck, but more than that would only matter with contention from another hyperthread.

An interesting question came up in comments: doesn't unrolling need a lot of registers to be useful?

Harold commented:

16 is not a huge number of registers, but it's enough to have 12 accumulators and 3 pieces of row vector from B and the broadcasted scalar from A, so it works out to just about enough. The loop from OP above barely uses any registers anyway. The 8 registers in 32bit are indeed too few.

Of course since the code in the question doesn't have "accumulators" in registers across loop iterations, only adding into memory, compilers could have optimized all of sum[0..n] to reuse the same register in asm; it's "dead" after storing. So actual register pressure is very low.

Yes x86-64 is somewhat register-poor, that's why AVX512 doubles the number as well as width of vector regs (zmm0..31). Yes, many RISCs have 32 int / 32 fp regs, including AArch64 up from 16 in ARM.

x86-64 has 16 scalar integer registers (including the stack pointer, not including the program counter), so normal functions can use 15. There are also 16 vector regs, xmm0..15. (And with AVX they're double the width ymm0..15).

(Some of this was written before I noticed that sum[0..n] was pointless, not loop-carried.)

Register renaming onto a large physical register file is sufficient in this case. There are other cases where having more architectural registers helps, especially for higher FP latency hence why AVX512 has 32 zmm regs. But for integer 16 is close to enough. RISC CPUs were often designed for in-order without reg renaming, needing SW pipeline.

With OoO exec, the jump from 8 to 16 architectural GP integer regs is more significant than a jump from 16 to 32 would be, in terms of reducing spill/reloads. (I've seen a paper that measured total dynamic instruction count for SPECint with various numbers of architectural registers. I didn't look it up again, but 8->16 might have been 10% total saving while 16->32 was only a couple %. Something like that).

But this specific problem doesn't need a lot of FP registers, only 4 vectors for sum[0..3] (if they were loop-carried) and maybe 1 temporary; x86 can use memory-source mul/add/FMA. Register renaming removes any WAW hazards so we can reuse the same temporary register instead of needing software pipelining. (And OoO exec also hides load and ALU latency.)

You want multiple accumulators when there are loop-carried dependencies. This code is adding into memory, not into a few vector accumulators, so any dependency is through store/reload. But that only has ~7 cycle latency so any sane cache-blocking factor hides it.

How to further optimize performance of Matrix Multiplication?

2 Answers2