14

Problem

I have been studying HPC, specifically using matrix multiplication as my project (see my other posts in profile). I achieve good performance in those, but not good enough. I am taking a step back to see how well I can do with a dot product calculation.

Dot Product vs. Matrix Multiplication

The dot product is simpler, and will allow me to test HPC concepts without dealing with packing and other related issues. Cache blocking is still an issue, which forms my second question.

Algorithm

Multiply n corresponding elements in two double arrays A and B and sum them. A double dot product in assembly is just a series of movapd, mulpd, addpd. Unrolled and arranged in a clever way, it is possible to have groups of movapd/mulpd/addpd that operate on different xmm registers and are thus independent, optimizing pipelining. Of course, it turns out that this does not matter so much as my CPU has out-of-order execution. Also note that the re-arrangement requires peeling off the last iteration.

Other Assumptions

I am not writing the code for general dot products. The code is for specific sizes and I am not handling fringe cases. This is just to test HPC concepts and to see what type of CPU usage I can attain.

Results

Compiled with gcc -std=c99 -O2 -m32 -mincoming-stack-boundary=2 -msse3 -mfpmath=sse,387 -masm=intel. I am on a different computer than usual. This computer has a i5 540m which can obtain 2.8 GHz * 4 FLOPS/cycle/core = 11.2 GFLOPS/s per core after a two-step Intel Turbo Boost (both cores are on right now so it only gets 2 step...a 4 step boost is possible if I turn off one core). 32 bit LINPACK gets around 9.5 GFLOPS/s when set to run with one thread.

       N   Total Gflops/s         Residual
     256         5.580521    1.421085e-014
     384         5.734344   -2.842171e-014
     512         5.791168    0.000000e+000
     640         5.821629    0.000000e+000
     768         5.814255    2.842171e-014
     896         5.807132    0.000000e+000
    1024         5.817208   -1.421085e-013
    1152         5.805388    0.000000e+000
    1280         5.830746   -5.684342e-014
    1408         5.881937   -5.684342e-014
    1536         5.872159   -1.705303e-013
    1664         5.881536    5.684342e-014
    1792         5.906261   -2.842171e-013
    1920         5.477966    2.273737e-013
    2048         5.620931    0.000000e+000
    2176         3.998713    1.136868e-013
    2304         3.370095   -3.410605e-013
    2432         3.371386   -3.410605e-013

Question 1

How can I do better than this? I am not even coming close to the peak performance. I have optimized the assembly code to high heaven. Further unrolling might boost it just a little more, but less unrolling seems to degrade performance.

Question 2

When n > 2048, you can see a drop in performance. This is because my L1 cache is 32KB, and when n = 2048 and A and B are double, they take up the entire cache. Any bigger and they are streamed from memory.

I tried cache blocking (not shown in source), but maybe I did it wrong. Can anyone provide some code or explain how to block a dot product for a cache?

Source Code

    #include <stdio.h>
    #include <time.h>
    #include <stdlib.h>
    #include <string.h>
    #include <x86intrin.h>
    #include <math.h>
    #include <omp.h>
    #include <stdint.h>
    #include <windows.h>

    // computes 8 dot products
#define KERNEL(address) \
            "movapd     xmm4, XMMWORD PTR [eax+"#address"]      \n\t" \
            "mulpd      xmm7, XMMWORD PTR [edx+48+"#address"]   \n\t" \
            "addpd      xmm2, xmm6                              \n\t" \
            "movapd     xmm5, XMMWORD PTR [eax+16+"#address"]   \n\t" \
            "mulpd      xmm4, XMMWORD PTR [edx+"#address"]      \n\t" \
            "addpd      xmm3, xmm7                              \n\t" \
            "movapd     xmm6, XMMWORD PTR [eax+96+"#address"]   \n\t" \
            "mulpd      xmm5, XMMWORD PTR [edx+16+"#address"]   \n\t" \
            "addpd      xmm0, xmm4                              \n\t" \
            "movapd     xmm7, XMMWORD PTR [eax+112+"#address"]  \n\t" \
            "mulpd      xmm6, XMMWORD PTR [edx+96+"#address"]   \n\t" \
            "addpd      xmm1, xmm5                              \n\t"

#define PEELED(address) \
            "movapd     xmm4, XMMWORD PTR [eax+"#address"]      \n\t" \
            "mulpd      xmm7, [edx+48+"#address"]               \n\t" \
            "addpd      xmm2, xmm6                  \n\t" \
            "movapd     xmm5, XMMWORD PTR [eax+16+"#address"]   \n\t" \
            "mulpd      xmm4, XMMWORD PTR [edx+"#address"]      \n\t" \
            "addpd      xmm3, xmm7                  \n\t" \
            "mulpd      xmm5, XMMWORD PTR [edx+16+"#address"]   \n\t" \
            "addpd      xmm0, xmm4                  \n\t" \
            "addpd      xmm1, xmm5                  \n\t"

inline double 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64))) ddot_ref(
    int n,
    const double* restrict A,
    const double* restrict B)
{
    double sum0 = 0.0;
    double sum1 = 0.0;
    double sum2 = 0.0;
    double sum3 = 0.0;
    double sum;
    for(int i = 0; i < n; i+=4) {
        sum0 += *(A + i  ) * *(B + i  );
        sum1 += *(A + i+1) * *(B + i+1);
        sum2 += *(A + i+2) * *(B + i+2);
        sum3 += *(A + i+3) * *(B + i+3);
    }
    sum = sum0+sum1+sum2+sum3;
    return(sum);
}

inline double 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64))) ddot_asm
(   int n,
    const double* restrict A,
    const double* restrict B)
{

        double sum;

            __asm__ __volatile__
        (
            "mov        eax, %[A]                   \n\t"
            "mov        edx, %[B]                   \n\t"
            "mov        ecx, %[n]                   \n\t"
            "pxor       xmm0, xmm0                  \n\t"
            "pxor       xmm1, xmm1                  \n\t"
            "pxor       xmm2, xmm2                  \n\t"
            "pxor       xmm3, xmm3                  \n\t"
            "movapd     xmm6, XMMWORD PTR [eax+32]  \n\t"
            "movapd     xmm7, XMMWORD PTR [eax+48]  \n\t"
            "mulpd      xmm6, XMMWORD PTR [edx+32]  \n\t"
            "sar        ecx, 7                      \n\t"
            "sub        ecx, 1                      \n\t" // peel
            "L%=:                                   \n\t"
            KERNEL(64   *   0)
            KERNEL(64   *   1)
            KERNEL(64   *   2)
            KERNEL(64   *   3)
            KERNEL(64   *   4)
            KERNEL(64   *   5)
            KERNEL(64   *   6)
            KERNEL(64   *   7)
            KERNEL(64   *   8)
            KERNEL(64   *   9)
            KERNEL(64   *   10)
            KERNEL(64   *   11)
            KERNEL(64   *   12)
            KERNEL(64   *   13)
            KERNEL(64   *   14)
            KERNEL(64   *   15)
            "lea        eax, [eax+1024]             \n\t"
            "lea        edx, [edx+1024]             \n\t"
            "                                       \n\t"
            "dec        ecx                         \n\t"
            "jnz        L%=                         \n\t" // end loop
            "                                       \n\t"
            KERNEL(64   *   0)
            KERNEL(64   *   1)
            KERNEL(64   *   2)
            KERNEL(64   *   3)
            KERNEL(64   *   4)
            KERNEL(64   *   5)
            KERNEL(64   *   6)
            KERNEL(64   *   7)
            KERNEL(64   *   8)
            KERNEL(64   *   9)
            KERNEL(64   *   10)
            KERNEL(64   *   11)
            KERNEL(64   *   12)
            KERNEL(64   *   13)
            KERNEL(64   *   14)
            PEELED(64   *   15)
            "                                       \n\t"
            "addpd      xmm0, xmm1                  \n\t" // summing result
            "addpd      xmm2, xmm3                  \n\t"
            "addpd      xmm0, xmm2                  \n\t" // cascading add
            "movapd     xmm1, xmm0                  \n\t" // copy xmm0
            "shufpd     xmm1, xmm0, 0x03            \n\t" // shuffle
            "addsd      xmm0, xmm1                  \n\t" // add low qword
            "movsd      %[sum], xmm0                \n\t" // mov low qw to sum
            : // outputs
            [sum]   "=m"    (sum)
            : // inputs
            [A] "m" (A),
            [B] "m" (B), 
            [n] "m" (n)
            : //register clobber
            "memory",
            "eax","ecx","edx","edi",
            "xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7"
            );
        return(sum);
}

int main()
{
    // timers
    LARGE_INTEGER frequency, time1, time2;
    double time3;
    QueryPerformanceFrequency(&frequency);
    // clock_t time1, time2;
    double gflops;

    int nmax = 4096;
    int trials = 1e7;
    double sum, residual;
    FILE *f = fopen("soddot.txt","w+");

    printf("%16s %16s %16s\n","N","Total Gflops/s","Residual");
    fprintf(f,"%16s %16s %16s\n","N","Total Gflops/s","Residual");

    for(int n = 256; n <= nmax; n += 128 ) {
        double* A = NULL;
        double* B = NULL;
        A = _mm_malloc(n*sizeof(*A), 64); if (!A) {printf("A failed\n"); return(1);}
        B = _mm_malloc(n*sizeof(*B), 64); if (!B) {printf("B failed\n"); return(1);}

        srand(time(NULL));

        // create arrays
        for(int i = 0; i < n; ++i) {
            *(A + i) = (double) rand()/RAND_MAX;
            *(B + i) = (double) rand()/RAND_MAX;
        }

        // warmup
        sum = ddot_asm(n,A,B);

        QueryPerformanceCounter(&time1);
        // time1 = clock();
        for (int count = 0; count < trials; count++){
            // sum = ddot_ref(n,A,B);
            sum = ddot_asm(n,A,B);
        }
        QueryPerformanceCounter(&time2);
        time3 = (double)(time2.QuadPart - time1.QuadPart) / frequency.QuadPart;
        // time3 = (double) (clock() - time1)/CLOCKS_PER_SEC;
        gflops = (double) (2.0*n*trials)/time3/1.0e9;
        residual = ddot_ref(n,A,B) - sum;
        printf("%16d %16f %16e\n",n,gflops,residual);
        fprintf(f,"%16d %16f %16e\n",n,gflops,residual);

        _mm_free(A);
        _mm_free(B);
    }
    fclose(f);
    return(0); // successful completion
}

EDIT: explanation of assembly

A dot product is just a repeat sum of products of two numbers: sum += a[i]*b[i]. sum must be initialized to 0 before the first iteration. Vectorized, you can do 2 sums at a time which must be summed at the end: [sum0 sum1] = [a[i] a[i+1]]*[b[i] b[i+1]], sum = sum0 + sum1. In (Intel) assembly, this is 3 steps (after the initialization):

pxor   xmm0, xmm0              // accumulator [sum0 sum1] = [0 0]
movapd xmm1, XMMWORD PTR [eax] // load [a[i] a[i+1]] into xmm1
mulpd  xmm1, XMMWORD PTR [edx] // xmm1 = xmm1 * [b[i] b[i+1]]
addpd  xmm0, xmm1              // xmm0 = xmm0 + xmm1

At this point you have nothing special, the compiler can come up with this. You can usually get better performance by unrolling the code enough times to use all xmm registers available to you (8 registers in 32 bit mode). So if you unroll it 4 times that allows you to utilize all 8 registers xmm0 through xmm7. You will have 4 accumulators and 4 registers for storing the results of movapd and addpd. Again, the compiler can come up with this. The real thinking part is trying to come up with a way to pipeline the code, i.e., make each instruction in the group of MOV/MUL/ADD operate on different registers so that all 3 instructions execute at the same time (usually the case on most CPUs). That's how you beat the compiler. So you have to pattern the 4x unrolled code to do just that, which may require loading vectors ahead of time and peeling off the first or last iteration. This is what KERNEL(address) is. I made a macro of the 4x unrolled pipelined code for convenience. That way I can easily unroll it in multiples of 4 by just changing address. Each KERNEL computes 8 dot products.

matmul
  • 589
  • 5
  • 16
  • You may want to use [compiler intrinsics](https://www.google.com/search?q=compiler+intrinsics+simd) instead of inline assembly code. It looks nicer. – tangrs Jun 08 '14 at 01:33
  • @tangrs, they don't optimize the way a human does by hand, regardless of flags. And they are slower. I have already tried them. – matmul Jun 08 '14 at 01:38
  • Hmmm, that's interesting. I always thought the intrinsics mapped 1-1 with the assembly below it. – tangrs Jun 08 '14 at 01:40
  • @tangrs I thought so too. They will usually generate the correct grouping of MOVPD/MULPD/ADDPD but I have not ever seem them do the type of re-ordering to make each MOV/MUL/ADD work on different registers. Ironically, compiler intrinsics produced a fast kernel for my matrix multiplication project, which worked faster than some other kernels I copied from Goto. Still, there was room for improvement by hand in the intrinsics case. – matmul Jun 08 '14 at 01:45
  • Can anyone comment on how many cycles each set of movpd/mulpd/addpd should take? That will give us an upper bound on the flops/cycle that can be obtained with the dot product. edit: Well it should still be 4 flops/cycle because of the way the assembly is written for pipelining...each mov/mul/add group should execute in 1 cycle and there are 4 flops. operations on a specific register are far enough apart such that the result should be ready by the time it is needed so there shouldn't be any stall. – matmul Jun 08 '14 at 02:07
  • Also wanted to add that unrolling it 32 times does make it faster by about 0.1 gflops/s. Seems like on this processor the jmp instructions might not be so good. I am tempted to completely unroll it for an array of 2048 elements. I'm sure at one point it will slow down, but Merom doesn't have an L1 instruction cache. – matmul Jun 08 '14 at 02:23
  • 1
    Why `-O2` instead of `-O3`? Why not `-march=native`? – Adam Jun 08 '14 at 02:37
  • Could you comment your assembly? – Adam Jun 08 '14 at 03:03
  • @Adam, `-O3` is just `-O2` with a few more experimental flags such as `-funroll-all-loops` which in many cases slow the code down. `-march=XXX` generates assembly which is supposed to work well for CPU type XXX. When most of the code you write is hand-optimized assembly, you are taking the compiler out of the equation. The compiler takes a high-level language and makes assembly out of it, with an attempt to "optimize." I claim that it's almost always possible to beat the compiler. I will make a small edit to explain the assembly. – matmul Jun 08 '14 at 18:59
  • @Adam, I think I have the answer, MOVAPD from memory to register takes at least 2 cycles for my CPU according to Agner Fog's manuals. In that case, I would only get 2 flops/cycle and a peak theoretical speed of 3.2 Gflops/s. I hope I am right about this, would be nice if someone could confirm. – matmul Jun 08 '14 at 20:07
  • Also adding that `-O0` (no compiler optimizations) gives the same performance as `-O3` – matmul Jun 08 '14 at 20:35
  • **ATTENTION** @Adam and anyone else following. I had a ton of bugs in my code, they are now fixed. – matmul Jun 09 '14 at 19:12
  • Please use a better definition of CPU performance. You appear to mean peak flop/s or peak op/s. Achieving this is impossible with dot product because you cannot load from memory fast enough to fead the FPU. On the other hand, if you define CPU performance to mean peak memory bandwidth, then you should be able to max out that pretty easily. See McAlpin's STREAM benchmark for analysis and best practices. – Jeff Hammond Jan 04 '15 at 05:35

1 Answers1

13

To answer your overall question you can't achieve peak performance with the dot product.

The problem is that your CPU can do one 128-bit load per clock cycle and to do the dot product you need two 128-bit loads per clock cycle.

But it's worse than that for large n. The answer to your second question is that the dot product is memory bound and not compute bound and so it cannot parallelize for large n with fast cores. This is explained better here why-vectorizing-the-loop-does-not-have-performance-improvement. This is a big problem with parallelization with fast cores. It took me a while to figure this out but it's very important to learn.

There are actually few basic algorithms that can fully benefit from parallelization on fast cores. In terms of BLAS algorithms it's only the Level-3 algorithms (O(n^3)) such as matrix multiplication that really benefit from parallelization. The situation is better on slow cores e.g. with GPUs and the Xeon Phi because the discrepancy between memory speed and core speed is much smaller.

If you want to find an algorithm which can get close to peak flops for small n try e.g. scalar * vector or the sum of scalar * vector. The first case should do one load, one mult, and one store every clock cycle and the second case one mult, one add, and one load every clock cycle.

I tested the following code on a Core 2 Duo P9600@2.67GHz in Knoppix 7.3 32-bit. I get about 75% of the peak for the scalar product and 75% of the peakfor the sum of the scalar product. The flops/cycle for the scalar product is 2 and for the sum of the scalar product it's 4.

Compiled with g++ -msse2 -O3 -fopenmp foo.cpp -ffast-math

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <x86intrin.h>

void scalar_product(double * __restrict a, int n) {
    a = (double*)__builtin_assume_aligned (a, 64);
    double k = 3.14159;
    for(int i=0; i<n; i++) {
        a[i] = k*a[i]; 
    }
}

void scalar_product_SSE(double * __restrict a, int n) {
    a = (double*)__builtin_assume_aligned (a, 64);
    __m128d k = _mm_set1_pd(3.14159);    
    for(int i=0; i<n; i+=8) {
        __m128d t1 = _mm_load_pd(&a[i+0]);
        _mm_store_pd(&a[i],_mm_mul_pd(k,t1));
        __m128d t2 = _mm_load_pd(&a[i+2]);
        _mm_store_pd(&a[i+2],_mm_mul_pd(k,t2));
        __m128d t3 = _mm_load_pd(&a[i+4]);
        _mm_store_pd(&a[i+4],_mm_mul_pd(k,t3));
        __m128d t4 = _mm_load_pd(&a[i+6]);
        _mm_store_pd(&a[i+6],_mm_mul_pd(k,t4));
    }
}

double scalar_sum(double * __restrict a, int n) {
    a = (double*)__builtin_assume_aligned (a, 64);
    double sum = 0.0;    
    double k = 3.14159;
    for(int i=0; i<n; i++) {
        sum += k*a[i]; 
    }
    return sum;
}

double scalar_sum_SSE(double * __restrict a, int n) {
    a = (double*)__builtin_assume_aligned (a, 64);
    __m128d sum1 = _mm_setzero_pd();
    __m128d sum2 = _mm_setzero_pd();
    __m128d sum3 = _mm_setzero_pd();
    __m128d sum4 = _mm_setzero_pd();
    __m128d k = _mm_set1_pd(3.14159);   
    for(int i=0; i<n; i+=8) {
        __m128d t1 = _mm_load_pd(&a[i+0]);
        sum1 = _mm_add_pd(_mm_mul_pd(k,t1),sum1);
        __m128d t2 = _mm_load_pd(&a[i+2]);
        sum2 = _mm_add_pd(_mm_mul_pd(k,t2),sum2);
        __m128d t3 = _mm_load_pd(&a[i+4]);
        sum3 = _mm_add_pd(_mm_mul_pd(k,t3),sum3);
        __m128d t4 = _mm_load_pd(&a[i+6]);
        sum4 = _mm_add_pd(_mm_mul_pd(k,t4),sum4);      
    }
    double tmp[8];
    _mm_storeu_pd(&tmp[0],sum1);
    _mm_storeu_pd(&tmp[2],sum2);
    _mm_storeu_pd(&tmp[4],sum3);
    _mm_storeu_pd(&tmp[6],sum4);
    double sum = 0;
    for(int i=0; i<8; i++) sum+=tmp[i];
    return sum;
}

int main() {
    //_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
    //_mm_setcsr(_mm_getcsr() | 0x8040);
    double dtime, peak, flops, sum;
    int repeat = 1<<18;
    const int n = 2048;
    double *a = (double*)_mm_malloc(sizeof(double)*n,64);
    double *b = (double*)_mm_malloc(sizeof(double)*n,64);
    for(int i=0; i<n; i++) a[i] = 1.0*rand()/RAND_MAX;

    dtime = omp_get_wtime();
    for(int r=0; r<repeat; r++) {
        scalar_product_SSE(a,n);
    }
    dtime = omp_get_wtime() - dtime;
    peak = 2*2.67;
    flops = 1.0*n/dtime*1E-9*repeat;
    printf("time %f, %f, %f\n", dtime,flops, flops/peak);

    //for(int i=0; i<n; i++) a[i] = 1.0*rand()/RAND_MAX;
    //sum = 0.0;    
    dtime = omp_get_wtime();
    for(int r=0; r<repeat; r++) {
        scalar_sum_SSE(a,n);
    }
    dtime = omp_get_wtime() - dtime;
    peak = 2*2*2.67;
    flops = 2.0*n/dtime*1E-9*repeat;
    printf("time %f, %f, %f\n", dtime,flops, flops/peak);
    //printf("sum %f\n", sum);

}
Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • I compiled this with `-msse3` for my computer and updated `peak` and I only get about 25% CPU usage. I also tried to unroll the sum function 4 times instead of 3 but for some reason it runs extremely slowly (almost to the point where I thought it was just hanging). As for the original question, I wrote in one of the comments how a `movapd` takes a minimum of 2 cycles according to A Fog, so each grouping of `movapd`/`mulpd`/`addpd` (which execute in parallel since they are operating on different registers) will take 2 cycles for 4 flops, so the best I could do is 50% peak (5.6 GFLOPS/s for me) – matmul Jun 10 '14 at 15:11
  • I'm sorry, my unroll3 function was no good. I redid this using intrinsics and tested it on a Core 2 Duo system which is very similar to yours. I'm getting about 50% of the peak right now. I'm not sure why...I updated my answer with the new code. Unrolling by three should be sufficient but I used four for convinence (so n only needed to be a multiple of two). – Z boson Jun 10 '14 at 17:20
  • OK thx, I was also wondering about the correctness of unrolling by 3 since the array length was a power of 2. It wasn't strictly correct but for performance measurements it was fine (we would just miss the last 3 elements). I will give this a shot soon. – matmul Jun 10 '14 at 17:25
  • @matmul, the function had other issues. I noramlly vectorize code myself. I should have done that from the start. – Z boson Jun 10 '14 at 17:27
  • @matmul, I edited my code again. I get 77% of the peak now with `sum += k*a[i]`. I had to initilize the array again after the scalar product. I'm not sure why. The scalar product still is only 50%, maybe Nahlem can't really do a load, store, and product each clock cycle. – Z boson Jun 10 '14 at 18:18
  • @matmul, okay, I figured out why I had to reinit the array (sorta). If I set the constant to 0.14159 instead of 3.14159 I don't need to reinit the array. Probably some floating point exception is happening (the sum is inf). Actually, I justed figured out that I don't even need to store the sum. I though this was necessary so the compiler did not optimize the code away but it's fine I guess since I'm using intrinsics. – Z boson Jun 10 '14 at 18:25
  • The Intel optimizaton manuals show that Nehalem can do a load and store in parallel. Then there are 3 execution units but I can't understand what each of those does. I don't see "SSE MUL" or "SSE ADD" in there. If I could better understand this part maybe I can improve my matrix multiplication kernel. – matmul Jun 10 '14 at 18:36
  • @matmul http://images.anandtech.com/reviews/cpu/intel/Haswell/Architecture/nehalemexec.png port0 and port1 do SSE ADD and SSE MUL port 2 does SSE load and port 4 does SSE store. As far as I understand it can do four microps/cycle. – Z boson Jun 10 '14 at 19:02
  • @matmul, ahh...I think I understand. I checked out Agner's mcirachriecture manual. Nahelm can do 3 or 4 microps per clock cycle. I think four requires fused operations. That explains why my scalar product won't get the peak. I think it's only getting 3 microps/cycle. – Z boson Jun 10 '14 at 19:06
  • Wouldn't micro-op fusion make your code run at 4 flops/cycle? A micro-op that normally takes 2 cycles would be split into two 1 cycle ops and executed serially or in parallel if it can. I think the issue with `scalar_product` is that it is bound by a load from memory, which takes a minimum of 2 cycles and so the best you can do at that point is 2 flops/cycle. That would explain getting only 50% CPU. But I could also be extremely wrong and this is just a coincidence. – matmul Jun 10 '14 at 19:30
  • @matmul, the 50% for scalar product is assuming only 2 flops/cycle. See the code. I'm not sure gcc is using micro-op fusion. I have not looked at the assembly. My guess is it's not. This is a good case for your assembly code. – Z boson Jun 10 '14 at 19:40
  • @matmul, actually the scalar product and the scalar product sum both only need 3 microps (load, mult, store and load, mult, add) so fusion should not be an issue but see this for an example of how it can make a difference https://stackoverflow.com/questions/21134279/difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp – Z boson Jun 10 '14 at 20:11
  • @matmul. I wrote the scalar product function using intrinsics just now but it did not make a difference. Actually, I don't understand why writing to a new array (b) is worse than writing back to the same array (a). – Z boson Jun 10 '14 at 20:12
  • Would it have something to do with cache and/or having a different address to store to? Here's a question related to the link above: how do out-of-order engines execute your g++ assembly out of order? You have groupings of mov/mul/add, but these form a dependency chain. Would they still execute together in 2 cycles (limited by the `vmovups` from memory)? I don't see how that assembly can be re-arranged to be executed out of order. – matmul Jun 10 '14 at 20:20
  • I just got the scalar product to 75% peak by unrolling 4 times. I'll update my answer. I have no idea why unrolling is necessary. There is no carried loop dependency but maybe GCC is stupid on this. – Z boson Jun 10 '14 at 20:22
  • @matmul, the out-of-order engine is a bit of a mystery to me still. But bascially the only depedency is the addition and it has a latecy of three so that's why you need to unroll by at least three times. I know that the add depends on the mult and load but as far as the out of order engion goes it only has trouble with the add. – Z boson Jun 10 '14 at 20:28
  • "the only depedency is the addition" : does that apply to the dot product example above? I thought I had eliminated that dependency completely in my mov/mul/add chain. – matmul Jun 10 '14 at 20:52
  • @matmul, no, that's what I though initially also but as far as the out of order engine goes the only dependency is the addition. See this http://stackoverflow.com/questions/21090873/loop-unrolling-to-achieve-maximum-throughput-with-ivy-bridge-and-haswell – Z boson Jun 11 '14 at 06:55
  • @matmul, I just realized that the sum of the scalar product could be done just by taking the sum and then multiplying k*(a[0] + a[1] + ...). But the result should be the same. That will two half the OPs but the peak is half as well. It frees up the mult port but that's not being used anyway so it makes not difference. – Z boson Jun 12 '14 at 08:07
  • @matmul, also I said to use Knoppix for 64-bit. That turns out to be wrong. GCC in Knoppix only compiles for 32-bit even when booting 64-bit (I figured this out yesterday). That's probably why some my unroll3 function (despite the multiple of three bug) worked well on 64-bit Ubuntu but not with Knoppix. GCC does not optimize as well for 32-bit. I'll have to find 64-bit GCC on USB now. The only way I know how to get high efficiency with matrix multiplication is with more than 8 SSE registers. – Z boson Jun 12 '14 at 08:11
  • I've been wondering how much better i could do with 64 bit matrix multiplication, but I don't think it would make a big difference. I compiled OpenBLAS for 32 bit and one thread and it gets over 95% of the theoretical max using just 8 `xmm` registers. I thought I saw Mystical say somewhere that you could get peak performance using just 2 registers. I've gotten my kernel to get close to 95%, but my overall code seem to have too much overhead so overall I ge just over 80%. – matmul Jun 12 '14 at 15:07
  • 80%! Post your code! I would not be surpised if it can be done in eight registers or less but I don't know how to do it yet. Mysticial is probably refering to tricks with register renaming. – Z boson Jun 12 '14 at 16:14
  • 1
    I will post it either today or tomorrow with explanations, I am trying to eliminate the overhead from packing (that's what's preventing me from getting closer to 95%). But if you would like a preview, I basically worked off the concept from the OpenBLAS kernel called `gemm_kernel_2x4_PENRYN.S` at loop `.L16` and made some improvements to it. The trick appears to be (1) data reuse / eliminate loads and (2) have it so as soon as you issue the first `mul`, the remaining `muls` issue once every cycle in the loop. This way you are only limited by the `adds`. – matmul Jun 12 '14 at 16:33
  • @matmul, did you give up on SO? Did you ever make your matrix multiplication tutorial? I was really looking forward to that. – Z boson Oct 06 '14 at 19:30
  • I never got a chance to do it, i;ve been really busy at work with another project. I want to come back to this at some point and see if I can get those few extra percent in performance to get closer to 100% cpu. – matmul Nov 01 '14 at 18:26
  • @matmul, if you're willing to write up your tutorial I can ask a question about how to obtain peak flops and make a 500 point bounty. I know you don't care so much about the points but it's best I can do on SO and your utorial is worth it to me. It would save me a lot of time which I don't have currently. If you're interesting just let me know when and I'll make the question. – Z boson Jan 25 '15 at 18:09