Replicating BLAS matrix multiplication performance: Can I match it?

Question

Background

If you have been following my posts, I am attempting to replicate the results found in Kazushige Goto's seminal paper on square matrix multiplication C = AB. My last post regarding this topic can be found here. In that version of my code, I follow the memory layering and packing strategy of Goto with an inner kernel computing 2x8 blocks of C using 128 bit SSE3 intrinsics. My CPU is i5-540M with hyperthreading off. Additional info about my hardware can be found in another post and is repeated below.

My Hardware

My CPU is an Intel i5 - 540M. You can find the relevant CPUID information on cpu-world.com. The microarchitecture is Nehalem (westmere), so it can theoretically compute 4 double precision flops per core per cycle. I will be using just one core (no OpenMP), so with hyperthreading off and 4-step Intel Turbo Boost, I should be seeing a peak of ( 2.533 Ghz + 4*0.133 Ghz ) * ( 4 DP flops/core/cycle ) * ( 1 core ) = 12.27 DP Gflops. For reference, with both cores running at peak, Intel Turbo Boost gives a 2-step speed up and I should get a theoretical peak of 22.4 DP Gflops.

My Software

Windows7 64 bit, but MinGW/GCC 32 bit due to restrictions on my computer.

What's new this time?

I compute 2x4 blocks of C. This gives better performance and is in line with what Goto says (half the registers shoubld be used to compute C). I have tried many sizes: 1x8, 2x8, 2x4, 4x2, 2x2, 4x4.
My inner kernel is hand-coded x86 assembly, optimized to the best of my ability (matches some of the kernels that Goto wrote), which gives a rather large performance boost over SIMD. This code is unrolled 8 times inside the inner kernel (defined as a macro for convenience), giving the best performance out of other unrolling strategies I tried.
I use the Windows performance counters to time the codes, rather than clock().
I time the inner kernel independently of the total code, to see how well my hand-coded assembly is doing.
I report the best result from some number of trials, rather than an average over the trials.
No more OpenMP (single core perfomance only).
NOTE I will recompile OpenBLAS tonight to use only 1 core so I can compare.

Some Preliminary Results

N is the dimension of the square matrices, Total Gflops/s is the Gflops/s of the entire code, and Kernel Gflops/s is the Gflops/s of the inner kernel. You can see that with a 12.26 Gflops/s peak on one core, the inner kernel is getting around 75% efficiency, while the overall code is about 60% efficient.

I would like to get closer to 95% efficiency for the kernel and 80% for the overall code. What else can I do to improve the performance, of at least the inner kernel?

       N   Total Gflops/s  Kernel Gflops/s
     256         7.778089         9.756284
     512         7.308523         9.462700
     768         7.223283         9.253639
    1024         7.197375         9.132235
    1280         7.142538         8.974122
    1536         7.114665         8.967249
    1792         7.060789         8.861958

Source Code

If you are feeling particularly magnanimous, please test my code on your machine. Compiled with gcc -std=c99 -O2 -m32 -msse3 -mincoming-stack-boundary=2 -masm=intel somatmul2.c -o somatmul2.exe. Feel free to try other flags, but I have found these to work best on my machine.

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <xmmintrin.h>
#include <math.h>
#include <omp.h>
#include <stdint.h>
#include <windows.h>

#ifndef min
    #define min( a, b ) ( ((a) < (b)) ? (a) : (b) )
#endif

inline void 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64))) rpack(        double* restrict dst, 
          const double* restrict src, 
            const int kc, const int mc, const int mr, const int n)
{
    double tmp[mc*kc] __attribute__ ((aligned(64)));
    double* restrict ptr = &tmp[0];

    for (int i = 0; i < mc; ++i)
        for (int j = 0; j < kc; j+=4) {
            *ptr++ = *(src + i*n + j  );
            *ptr++ = *(src + i*n + j+1);
            *ptr++ = *(src + i*n + j+2);
            *ptr++ = *(src + i*n + j+3);
        }
    ptr = &tmp[0];

    //const int inc_dst = mr*kc;
    for (int k = 0; k < mc; k+=mr)
        for (int j = 0; j < kc; ++j)
            for (int i = 0; i < mr*kc; i+=kc)
                *dst++ = *(ptr + k*kc + j + i);

}

inline void 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64)))  cpack(double* restrict dst, 
                const double* restrict src, 
                const int nc, 
                const int kc, 
                const int nr, 
                const int n)
{ //nc cols, kc rows, nr size ofregister strips
    double tmp[kc*nc] __attribute__ ((aligned(64)));
    double* restrict ptr = &tmp[0];

    for (int i = 0; i < kc; ++i)
        for (int j = 0; j < nc; j+=4) {
            *ptr++ = *(src + i*n + j  );
            *ptr++ = *(src + i*n + j+1);
            *ptr++ = *(src + i*n + j+2);
            *ptr++ = *(src + i*n + j+3);
        }

    ptr = &tmp[0];

    // const int inc_k = nc/nr;
    for (int k = 0; k < nc; k+=nr)
        for (int j = 0; j < kc*nc; j+=nc)
            for (int i = 0; i < nr; ++i)
                *dst++ = *(ptr + k + i + j);

}

#define KERNEL0(add0,add1,add2,add3)                                \
            "mulpd      xmm4, xmm6                      \n\t" \
            "addpd      xmm0, xmm4                      \n\t" \
            "movapd     xmm4, XMMWORD PTR [edx+"#add2"] \n\t" \
            "mulpd      xmm7, xmm4                      \n\t" \
            "addpd      xmm1, xmm7                      \n\t" \
            "movddup        xmm5, QWORD PTR [eax+"#add0"]   \n\t" \
            "mulpd      xmm6, xmm5                      \n\t" \
            "addpd      xmm2, xmm6                      \n\t" \
            "movddup        xmm7, QWORD PTR [eax+"#add1"]   \n\t" \
            "mulpd      xmm4, xmm5                      \n\t" \
            "movapd     xmm6, XMMWORD PTR [edx+"#add3"] \n\t" \
            "addpd      xmm3, xmm4                      \n\t" \
            "movapd     xmm4, xmm7                      \n\t" \
            "                                           \n\t"

inline void 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64))) dgemm_2x4_asm_j
(   
        const int mc, const int nc, const int kc,
        const double* restrict locA, const int cs_a, // mr
        const double* restrict locB, const int rs_b, // nr
              double* restrict C, const int rs_c
    )
{

    const double* restrict a1 = locA;
    for (int i = 0; i < mc ; i+=cs_a) {
        const double* restrict b1 = locB;
        double* restrict c11 = C + i*rs_c;
        for (int j = 0; j < nc ; j+=rs_b) {

            __asm__ __volatile__
        (
            "mov        eax, %[a1]                  \n\t"
            "mov        edx, %[b1]                  \n\t"
            "mov        edi, %[c11]                 \n\t"
            "mov        ecx, %[kc]                  \n\t"
            "pxor       xmm0, xmm0                  \n\t"
            "movddup    xmm7, QWORD PTR [eax]       \n\t" // a1
            "pxor       xmm1, xmm1                  \n\t"
            "movapd     xmm6, XMMWORD PTR [edx]     \n\t" // b1
            "pxor       xmm2, xmm2                  \n\t"
            "movapd     xmm4, xmm7                  \n\t" // a1
            "pxor       xmm3, xmm3                  \n\t"
            "sar        ecx, 3                      \n\t" // divide by 2^num
            "L%=:                                   \n\t" // start loop
            KERNEL0(    8,      16,     16,     32)
            KERNEL0(    24,     32,     48,     64)
            KERNEL0(    40,     48,     80,     96)
            KERNEL0(    56,     64,     112,    128)
            KERNEL0(    72,     80,     144,    160)
            KERNEL0(    88,     96,     176,    192)
            KERNEL0(    104,    112,    208,    224)
            KERNEL0(    120,    128,    240,    256)
            "add        eax, 128                    \n\t"
            "add        edx, 256                    \n\t"
            "                                       \n\t"
            "dec        ecx                         \n\t"
            "jne        L%=                         \n\t" // end loop
            "                                       \n\t"
            "mov        esi, %[rs_c]                \n\t" // don't need cs_a anymore
            "sal        esi, 3                      \n\t" // times 8
            "lea        ebx, [edi+esi]              \n\t" // don't need b1 anymore
            "addpd      xmm0, XMMWORD PTR [edi]     \n\t" // c11
            "addpd      xmm1, XMMWORD PTR [edi+16]  \n\t" // c11 + 2
            "addpd      xmm2, XMMWORD PTR [ebx]     \n\t" // c11
            "addpd      xmm3, XMMWORD PTR [ebx+16]  \n\t" // c11 + 2
            "movapd     XMMWORD PTR [edi], xmm0     \n\t"
            "movapd     XMMWORD PTR [edi+16], xmm1  \n\t"
            "movapd     XMMWORD PTR [ebx], xmm2     \n\t"
            "movapd     XMMWORD PTR [ebx+16], xmm3  \n\t"
            : // no outputs 
            : // inputs
            [kc]    "m" (kc),
            [a1]    "m" (a1), 
            [cs_a]  "m" (cs_a),
            [b1]    "m" (b1),
            [rs_b]  "m" (rs_b),
            [c11]   "m" (c11),
            [rs_c]  "m" (rs_c)
            : //register clobber
            "memory",
            "eax","ebx","ecx","edx","esi","edi",
            "xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7"
            );
        b1 += rs_b*kc;
        c11 += rs_b;
        }
    a1 += cs_a*kc;
    }
}


double blis_dgemm_ref(
        const int n,
        const double* restrict A,
        const double* restrict B,
        double* restrict C,
        const int mc,
        const int nc,
        const int kc
    )
{
    int mr = 2;
    int nr = 4;
    double locA[mc*kc] __attribute__ ((aligned(64)));
    double locB[kc*nc] __attribute__ ((aligned(64)));

    LARGE_INTEGER frequency, time1, time2;
    double time3 = 0.0;
    QueryPerformanceFrequency(&frequency);

    // zero C
    memset(C, 0, n*n*sizeof(double));

    int ii,jj,kk;
    //#pragma omp parallel num_threads(2) shared(A,B,C) private(ii,jj,kk,locA,locB)
    {//use all threads in parallel
        //#pragma omp for
        for ( jj = 0; jj < n; jj+=nc) { 
            for ( kk = 0; kk < n; kk+=kc) {
                cpack(locB, B + kk*n + jj, nc, kc, nr, n);
                for ( ii = 0; ii < n; ii+=mc) {
                    rpack(locA, A + ii*n + kk, kc, mc, mr, n);
                            QueryPerformanceCounter(&time1);
                            dgemm_2x4_asm_j( mc, nc, kc,
                                       locA         ,  mr,
                                       locB         ,  nr,
                                       C + ii*n + jj,  n );
                            QueryPerformanceCounter(&time2);
                            time3 += (double) (time2.QuadPart - time1.QuadPart);
                }
            }
        }
    }
    return time3 / frequency.QuadPart;
}

double compute_gflops(const double time, const int n)
{
    // computes the gigaflops for a square matrix-matrix multiplication
    double gflops;
    gflops = (double) (2.0*n*n*n)/time/1.0e9;
    return(gflops);
}

void main() {
    LARGE_INTEGER frequency, time1, time2;
    double time3, best_time;
    double kernel_time, best_kernel_time;
    QueryPerformanceFrequency(&frequency);
    int best_flag;
    double gflops, kernel_gflops;

    const int trials = 100;
    int nmax = 4096;
    printf("%16s %16s %16s\n","N","Total Gflops/s","Kernel Gflops/s");

    int mc = 256;
    int kc = 256;
    int nc = 128;
    for (int n = kc; n <= nmax; n+=kc) {
        double *A = NULL;   
        double *B = NULL;
        double *C = NULL;

        A = _mm_malloc (n*n * sizeof(*A),64); if (!A) {printf("A failed\n"); return;}
        B = _mm_malloc (n*n * sizeof(*B),64); if (!B) {printf("B failed\n"); return;}
        C = _mm_malloc (n*n * sizeof(*C),64); if (!C) {printf("C failed\n"); return;}

        srand(time(NULL));

        // Create the matrices
        for (int i = 0; i < n; i++) {
            for (int j = 0; j < n; j++) {
                *(A + i*n + j) = (double) rand()/RAND_MAX;
                *(B + i*n + j) = (double) rand()/RAND_MAX;
            }
        }

            // warmup
            blis_dgemm_ref(n,A,B,C,mc,nc,kc);

            for (int count = 0; count < trials; count++){
                    QueryPerformanceCounter(&time1);
                    kernel_time = blis_dgemm_ref(n,A,B,C,mc,nc,kc);
                    QueryPerformanceCounter(&time2);
                    time3 = (double)(time2.QuadPart - time1.QuadPart) / frequency.QuadPart;
                if (count == 0) {
                    best_time = time3;
                    best_kernel_time = kernel_time;
                }
                else {
                    best_flag = ( time3 < best_time ? 1 : 0 );
                    if (best_flag) {
                        best_time = time3;
                        best_kernel_time = kernel_time;
                    }
                }
            }
            gflops = compute_gflops(best_time, n);
            kernel_gflops = compute_gflops(best_kernel_time, n);
            printf("%16d %16f %16f\n",n,gflops,kernel_gflops);

        _mm_free(A);
        _mm_free(B);
        _mm_free(C);
    }
    printf("tests are done\n");

}

EDIT

Replace the packing functions with the following new and faster versions:

inline void 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64))) rpack(        double* restrict dst, 
          const double* restrict src, 
            const int kc, const int mc, const int mr, const int n)
{
    for (int i = 0; i < mc/mr; ++i)
        for (int j = 0; j < kc; ++j)
            for (int k = 0; k < mr; ++k)
                *dst++ = *(src + i*n*mr + k*n + j);
}

inline void 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64)))  cpack(double* restrict dst, 
                const double* restrict src, 
                const int nc, 
                const int kc, 
                const int nr, 
                const int n)
{   
    for (int i = 0; i < nc/nr; ++i)
        for (int j = 0; j < kc; ++j)
            for (int k = 0; k < nr; ++k)
                *dst++ = *(src + i*nr + j*n + k);
}

Results With New Packing Functions

Nice boost to the overall performance:

       N   Total Gflops/s  Kernel Gflops/s
     256         7.915617         8.794849
     512         8.466467         9.350920
     768         8.354890         9.135575
    1024         8.168944         8.884611
    1280         8.174249         8.825920
    1536         8.285458         8.938712
    1792         7.988038         8.581001

LINPACK 32-bit with 1 thread

CPU frequency:    2.792 GHz
Number of CPUs: 1
Number of cores: 2
Number of threads: 1

Performance Summary (GFlops)
Size   LDA    Align.  Average  Maximal
128    128    4       4.7488   5.0094  
256    256    4       6.0747   6.9652  
384    384    4       6.5208   7.2767  
512    512    4       6.8329   7.5706  
640    640    4       7.4278   7.8835  
768    768    4       7.7622   8.0677  
896    896    4       7.8860   8.4737  
1024   1024   4       7.7292   8.1076  
1152   1152   4       8.0411   8.4738  
1280   1280   4       8.1429   8.4863  
1408   1408   4       8.2284   8.7073  
1536   1536   4       8.3753   8.6437  
1664   1664   4       8.6993   8.9108  
1792   1792   4       8.7576   8.9176  
1920   1920   4       8.7945   9.0678  
2048   2048   4       8.5490   8.8827  
2176   2176   4       9.0138   9.1161  
2304   2304   4       8.1402   9.1446  
2432   2432   4       9.0003   9.2082  
2560   2560   4       8.8560   9.1197  
2688   2688   4       9.1008   9.3144  
2816   2816   4       9.0876   9.3089  
2944   2944   4       9.0771   9.4191  
3072   3072   4       8.9402   9.2920  
3200   3200   4       9.2259   9.3699  
3328   3328   4       9.1224   9.3821  
3456   3456   4       9.1354   9.4082  
3584   3584   4       9.0489   9.3351  
3712   3712   4       9.3093   9.5108  
3840   3840   4       9.3307   9.5324  
3968   3968   4       9.3895   9.5352  
4096   4096   4       9.3269   9.3872

EDIT 2

Here are some results from single-threaded OpenBLAS, which took 4 hours to compile last night. As you can see, it is getting close to 95% CPU usage. Max single-threaded performance with both cores on is 11.2 Gflops (2 step Intel Turbo Boost). I need to turn off the other core to get 12.26 Gflops (4 step Intel Turbo Boost). Assume that the packing functions in OpeBLAS generate no additional overhead. Then the OpenBLAS kernel must be running at least as fast as the total OpenBLAS code. So I need to get my kernel running at that speed. I have yet to figure out how to make my assembly faster. I will be focusing on this over the next few days.

Ran the tests below from Windows command line with: start /realtime /affinity 1 My Code:

   N   Total Gflops/s  Kernel Gflops/s
       256   7.927740   8.832366
       512   8.427591   9.347094
       768   8.547722   9.352993
      1024   8.597336   9.351794
      1280   8.588663   9.296724
      1536   8.589808   9.271710
      1792   8.634201   9.306406
      2048   8.527889   9.235653

OpenBLAS:

   N   Total Gflops/s
   256  10.599065
   512  10.622686
   768  10.745133
  1024  10.762757
  1280  10.832540
  1536  10.793132
  1792  10.848356
  2048  10.819986

@Adam, I added a section about my hardware explaining the calculation — matmul, Jun 05 '14 at 23:40
Are you sure Turbo Boost is kicking in? The CPU gets to decide on that based on things like temperature. Without it your theoretical peak would be 10.132 Gflops. — Adam, Jun 05 '14 at 23:48
I'd suggest to first write a code that hits that peak in any way. Maybe just an SIMD multiply loop. Then measure your memory bandwidth, i.e. how fast can you fill those registers. That should give you a roofline on what your memory can do. It's entirely possible that memory bandwidth limits your computation. Where "memory" == "L1 data cache". — Adam, Jun 05 '14 at 23:50
How about a dot product hand-written in assembly, that uses all the registers and performs the operations on cached blocks in parallel. I will write up the pseudocode for that tonight/tomorrow morning. I also posted the LINPACK results in the OP. — matmul, Jun 06 '14 at 00:18
At dot product is a good idea. A large N dot product would double as a memory benchmark, similar to STREAM. A loop with a small N, such as the width of one or two SIMD registers, might be required to hit peak FLOPS. — Adam, Jun 06 '14 at 01:15
BTW, look at that CPU frequency reported by Linpack. That's less than 2.533 Ghz + 4*0.133 Ghz. — Adam, Jun 06 '14 at 01:17
Yea, I've been reading on some forums about how you have to physically disable all except one core in order to get the maximum turbo boost of 2.533 + 4*0.133. As long as both cores are on, I will only get 2.533 + 2*0.133 = 2.8 GHz per core, regardless of whether I set threads = 1 or /affinity 1. — matmul, Jun 06 '14 at 03:13
You might find this handy: http://stackoverflow.com/questions/8389648/how-do-i-achieve-the-theoretical-maximum-of-4-flops-per-cycle BTW, I enjoy reading your posts about this as I'm writing sparse matrix-sparse matrix multiplication codes now. They are only tangentially related but it's good to read about your progress and I wouldn't be surprised if I find a way to apply your tricks somehow in the future. — Adam, Jun 06 '14 at 04:10
How did you change the packing functions to make your code faster? — Z boson, Jun 06 '14 at 09:34
Good that you concentrate on one core for now. OpenMP is a whole new bag of worms. — Z boson, Jun 06 '14 at 09:36
BTW. You can still use `omp_get_wtime()`. It's accurate and cross-platform. — Z boson, Jun 06 '14 at 09:39
I'm surprised you still continue recalculating `rpack` every `jj` iteration (recalculating the same thing). I guess that's what Goto did? — Z boson, Jun 06 '14 at 09:41
Adam, it's been a humbling experience. That thread was one of the first things I looked at. I get around 11 Gflops with Mystical's code, which is slow. @Zboson, look under "EDIT" where you can see the new packing functions and compare them with the old ones in the code above it. The old packing functions were temporary and just bad (they were copying the array twice). But as you can see the kernel still runs at the same speed. I will post another edit with some more info about single-threaded BLAS performance. — matmul, Jun 06 '14 at 17:19
The numbers you print out measure `blis_dgemm_ref()` as a whole, so why do you call `QueryPerformanceCounter` in the kernel too? — Adam, Jun 06 '14 at 17:41
It might be easier to work with TurboBoost simply disabled, so you're not worrying about an uncertain moving target. You'll then get some known fraction of a more easily calculated peak, and have a more straighforward time optimizing. — Phil Miller, Jun 06 '14 at 17:49
@Adam, I am measuring the total execution speed and the speed of the inner kernel as well. The two separate measurements give me an idea of how fast the actual multiplication part is running and how much overhead the outer loops and packing add. The numbers tell me that my kernel is too slow. Novelocrat, I don't have that option in my BIOS. — matmul, Jun 06 '14 at 18:06
I ask because measuring time can be quite expensive at these scales. I suspect that that and the frequency query account for a significant difference between your kernel and your total times. — Adam, Jun 06 '14 at 19:12
So you think my kernel is actually faster than what is being reported? The subject of timing executions boggles my mind. I can't seem to wrap my head around it. So many functions/performance counters/etc and so many different things they report. — matmul, Jun 06 '14 at 20:11
Hi @Adam,I will be making a new SO post in a couple of hours regarding a dot product code that I wrote to test some of the concepts we talked about. Be on the lookout! — matmul, Jun 07 '14 at 20:23
Yeah, I suspect that your gap between total and kernel times is actually narrower than reported. It would be easy to check; just comment out the kernel timing and see what happens to the total time. My lower-level timers are guarded by `#ifdef` so it's trivial to turn them on/off for exactly this reason. — Adam, Jun 08 '14 at 00:39
OK @Adam I will do that on Monday. Also dot product here: http://stackoverflow.com/questions/24102103/how-do-i-attain-peak-cpu-performance-with-dot-product — matmul, Jun 08 '14 at 01:08
@Zboson, any thoughts on this topic? http://stackoverflow.com/questions/24102103/how-do-i-attain-peak-cpu-performance-with-dot-product — matmul, Jun 09 '14 at 01:40
@matmul: I've done similar functions myself (some experimentation on nonlinear data ordering for right matrix, e.g. B11 B12 B21 B22 B31 .. for SSE3), but I use Linux, so I cannot run your code. I use rdtsc, i.e. `__builtin_ia32_rdtsc()` in GCC, to measure elapsed clock cycles; only use on CPUs with reliable TSC (I use an AMD Barcelona right now). Two suggestions: 1) Use a 64-bit integer type (`uint64_t` from `stdint.h`, for example) instead of `double` to pack the data, and 2) interleave your register use (have at least one (three?) instruction between calculation and result register use). — Nominal Animal, Jun 12 '14 at 15:29
Hi @NominalAnimal. Can you explain the bit about using `uint64_t` to pack the data? Also, ignore the assembly kernel. I have a new kernel that runs significantly faster (close to 95% peak cpu). The challenge now is to eliminate the overhead from packing, which drops my performance down to 80%. — matmul, Jun 12 '14 at 16:26
@matmul: packing just reorders the data, right? You can use an (unsigned) integer type of the same size as `double` on x86-64 (and on all other currently widely-used hardware architectures), so that normal registers, including additional registers `r8` to `r15` on x8-64, are used instead of floating point registers. Just replace `double` with `uint64_t`, and cast the pointer when calling `rpack()`/`cpack()`. Also, add logic to avoid packing when not necessary, and unroll and interleave that, too (including the addressing logic). — Nominal Animal, Jun 12 '14 at 17:39
@NominalAnimal, while I have that type (on my 64 bit system -- and it works), I only have access to 32 bit GCC (compute restrictions), so I cannot access those extra registers. I have moved the packing completely outside the loops and access these reordered versions of `A` and `B`. I unrolled the pack functions 32 times each and they read/write to `xmm` registers when copying elements. Sadly, because of the way the re-ordering is done, I can only do 1 element at a time so it only uses the lower quadword of `xmm`. — matmul, Jun 12 '14 at 18:01
@NominalAnimal, while packing outside the loops eliminates most of the overhead, the packed matrices (`_mm_malloc`'d) are now too large to fit in the L2 cache and this slows down the inner kernel by a lot. Now I am back to figuring out how to move the data I need into L2 before running the kernel. — matmul, Jun 12 '14 at 18:55
@matmul: Using xmm registers is slower than normal registers. The difference may be small (1% - 2%), but I have a nagging suspicion CPUs do better access prediction on normal registers than xmm registers. As to L2 cache: GCC has [`__builtin_prefetch(ptr)`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) you can sometimes use.. my data is large, so that's why I'm looking at various data orders to maximize cache locality, and interleaving data packing (gathering) with the multiplication (kind of prefetch, also much less temp mem needed overall). Non-square matrices are another wrinkle! — Nominal Animal, Jun 12 '14 at 19:14
@NominalAnimal, for the matrix multiplication problem it turns out that you allocate a block of one matrix, say `B` to the stack (L2) and the other, say `A`, to RAM with `_mm_malloc`. `A` is also prepacked outside the main loops. The loop order must also be such that you are **not** packing `B` (the L2 block) more than necessary (e.g., how I have it in my OP). This all assumes that you are multiplying `AB` in such a way that slivers of `A` are reused, forcing those elements (hopefully) into L1. In my assembly kernel you can see `a1` slivers being reused inside `while (j--)` — matmul, Jun 12 '14 at 21:49
Unfortunately, I think the fact that I am also reloading `a1` inside the `while (j--)` loop doesn't bring those elements into the cache. Looks like I may need to use actual assembly to fine tune this. When all elements are in L1/L2, the assembly kernel executes nearly at the peak of the CPU. — matmul, Jun 12 '14 at 21:51
First doing memory moves (especially up to L2 size), then doing ALU/FP-intensive operations, and then again doing memory moves, will never produce maximum throughput (minimum runtime). You need to interleave the memory accesses and the multiplications, so that you get maximum out of *both* RAM/cache bandwidth and ALU units; essentially exploiting the latencies. This includes packing, too. In particular, you *can* pack the next row/column while calculating results based on the previous row/column. It is darned hard (complex!), but it shows in run time. — Nominal Animal, Jun 12 '14 at 22:34
@matmul, I'm pretty sure I can get over 70% of the peak with SSE (including with OpenMP). But if you're already getting 80% I don't see the point in giving an answer. Why don't you post your new 80% code as an answer your your bounty and if I find the time to modify my code for SSE (it's currently using AVX or FMA) I'll post mine maybe this weekend. — Z boson, Jun 13 '14 at 12:34
I'm up to 85% over all `n`, I'm trying to work out some kinks. My kernel is capable of getting over 95% depending on memory accesses. I think the bounty is nonrefundable and I can't award it to myself anyway, so if you just post anything I can give it to you so it won't go to waste. — matmul, Jun 13 '14 at 17:35

Aaron Altman · Answer 1 · 2014-07-17T13:57:27.200

It's theoretically possible to look at that code and reason through whether it could be arranged to make better use of microarchitectural resources - but even the performance architects at Intel might not recommend doing it that way. It might help to use a tool like VTune or Intel Performance Counter Monitor to find out how much of your workload is memory versus front-end versus back-end bound. Intel Architecture Code Analyzer might also be a quick source of help narrowing down which of the potential issues listed below to follow up on first.

Nominal Animal is probably on the right track in the comments talking about interleaving instructions that access memory and those that do computation. A few other possibilities:

Using other instructions for some of the computation might reduce pressure on one of the execution ports (see section 3.3.4 of this presentation). In particular, mulpd is always going to dispatch to port 1 on Westmere. Maybe if there are any cycles where port 0 isn't getting used, you could sneak in a scalar FP multiply there.
One or another of the hardware prefetchers could be saturating the bus early or polluting the cache with lines you don't end up using.
On the other hand, there's a slim possibility that the ordering of memory references or the memory layout implied in dgemm_2x4_asm_j is faking out the prefetchers.
Changing the relative ordering of pairs of instructions that don't have any data dependencies might lead to better use of front-end or back-end resources.

Hey I've made tremendous progress since I posted this. I am able to get over 90% of the peak theoretical performance. But OpenBLAS is getting over 97%. I'm not sure what voodoo magic is going on there, but I intend to retur to this problem in the future when I have some time. — matmul, Jul 17 '14 at 19:38

Replicating BLAS matrix multiplication performance: Can I match it?

1 Answers1

Linked