ZGEMM using SSE, not giving speedup

Question

I'm performing a matrix matrix multiplication of complex doubles (ZGEMM) and thought using SSE will help, but its not, in fact its slowing the code down. I wanted to know if, maybe, its because its memory bound?

Heres the pseudocode:

For multiplying two complex doubles I use the following, as proposed by intel (assuming real and complex are stored contiguously):

If M=(a+ib) and IN= (c+id):

M1 = _mm_loaddup_pd(&M[0]);//M1->|a|a|
I1 = _mm_load_pd(&IN);//I1->|d|c|
T1 = _mm_mul_pd(M1,I1);//T1->|a*d|a*c|

M1 = _mm_loaddup_pd(&M[1]);//M1->|b|b|
I1 = _mm_shuffle_pd(I1,I1,1);//I1->|c|d|
I1 = _mm_mul_pd(M1,I1);//I1->|b*c|b*d|

T1 = _mm_addsub_pd(T1,I1); //T1-> |ad+bc|ac-bd|

Thus T1 stores the result of the complex multiplication.

This is the matrix multiplication(C[i][j] += A[i][k]*B[k][j]):

/*Assumes real and imaginary elements are stored contiguously*/
/*Used loop order: ikj for better cache locality and used 2*2 block matrix mult*/
for(i=0;i<N;i+=2){
  for(k=0;k<N;k+=2){
    /*Perform the _mm_loaddup() part here for A[i][k],A[i][k+1],A[i+1][k],A[i+1][k+1] since im blocking for 2*2 matrix mult i.e will load duplicates of 8 double values into 8 SIMD registers here*/
     A00r = _mm_loaddup_pd(&A[(i*N+k)*2+0]);
     A00i = _mm_loaddup_pd(&A[(i*N+k)*2+1]);
     A01r = _mm_loaddup_pd(&A[(i*N+k)*2+2]);
     A01i = _mm_loaddup_pd(&A[(i*N+k)*2+3]);
     A10r = _mm_loaddup_pd(&A[((i+1)*N+k)*2+0]);
     A10i = _mm_loaddup_pd(&A[((i+1)*N+k)*2+1);
     A11r = _mm_loaddup_pd(&A[((i+1)*N+k)*2+2);
     A11i = _mm_loaddup_pd(&A[((i+1)*N+k)*2+2);
     for(j=0;j<N;j+=2){
       double out[8] = {0,0,0,0,0,0,0,0};
       op00=op01=op10=op11=_mm_setzero_pd();

       B00 = _mm_loadu_pd(&B[(k*N+j)*2]);
       B00_r = _mm_shuffle_pd(B00,B00,1);
       B01 = _mm_loadu_pd(&B[(k*N+j+1)*2]);
       B01_r = _mm_shuffle_pd(B01,B01,1);

       /*Perform A[i][k]*B[k][j], A[i][k]*B[k][j+1], A[i+1][k]*B[k][j], A[i+1][k]*B[k][j+1]  and assign it to op00,op01,op10,op11 respectively -> takes 8 _mm_mul_pd() and 4 _mm_addsub_pd()*/

       T1 = _mm_mul_pd(A00r,B00);
       T2 = _mm_mul_pd(A00i,B00_r);
       op00 = _mm_addsub_pd(T1,T2);

       T1 = _mm_mul_pd(A00r,B01);
       T2 = _mm_mul_pd(A00i,B01_r);
       op01 = _mm_addsub_pd(T1,T2);

       T1 = _mm_mul_pd(A10r,B00);
       T2 = _mm_mul_pd(A10i,B00_r);
       op10 = _mm_addsub_pd(T1,T2);

       T1 = _mm_mul_pd(A10r,B01);
       T2 = _mm_mul_pd(A10i,B01_r);
       op11 = _mm_addsub_pd(T1,T2);

       B00 = _mm_loadu_pd(&B[((k+1)*N+j)*2]);
       B00_r = _mm_shuffle_pd(B00,B00,1);
       B01 = _mm_loadu_pd(&B[((k+1)*N+j+1)*2]);
       B00_r = _mm_shuffle_pd(B01,B01,1);

      /*Perform A[i][k+1]*B[k+1][j],A[i][k+1]*B[k+1][j+1],A[i+1][k+1]*B[k+1][j],A[i+1][k+1]*B[k+1][j+1] and add it to op00,op01,op10,op11 respectively-> takes 8 _mm_mul_pd(), 4 _mm_add_pd(), 4 _mm_addsub_pd()*/

       T1 = _mm_mul_pd(A01r,B10);
       T2 = _mm_mul_pd(A01i,B10_r);
       op00 = _mm_add_pd(op00,_mm_addsub_pd(T1,T2));

       T1 = _mm_mul_pd(A01r,B11);
       T2 = _mm_mul_pd(A01i,B11_r);
       op01 = _mm_add_pd(op01,_mm_addsub_pd(T1,T2));

       T1 = _mm_mul_pd(A11r,B10);
       T2 = _mm_mul_pd(A11i,B10_r);
       op10 = _mm_add_pd(op10,_mm_addsub_pd(T1,T2));

       T1 = _mm_mul_pd(A11r,B11);
       T2 = _mm_mul_pd(A11i,B11_r);
       op11 = _mm_add_pd(op11,_mm_addsub_pd(T1,T2));

      /*Store op00,op01,op10,op11 into out[0],out[2],out[4] and out[6] -> 4 stores*/

       _mm_storeu_pd(&out[0],op00);
       _mm_storeu_pd(&out[2],op01);
       _mm_storeu_pd(&out[4],op10);
       _mm_storeu_pd(&out[6],op11);
      /*Perform the following 8 operations*/
      C[(i*N+j)*2+0] += out[0];
      C[(i*N+j)*2+1] += out[1];
           .
           .
           .
      C[((i+1)*N+j)*2+3] += out[7];
     }
  }
}

The L1 cache is of 32KB, so I used cache blocking too (tile size of 16*16 which makes the working set size to be 12KB(3*2^4*2^4*2^3*2)) but it didn't help much. I'm only getting about 50% of the theoretical peak performance. Any pointers on how I could improve this?

@Mysticial: It is when I perform the multiplications. For example, if I perform A[i][k]*B[k][j], and I have stored the real and imaginary parts of A[i][k] in A_r and A_i, B_ri stores real and imaginary parts of B[k][j] contiguously and B_ri_r stores them in reverse order, I'll have to do: T1 = _mm_mul_pd(A_r,B_ri);T2 = _mm_mul_pd(A_i,B_ri_r);op00 = _mm_addsub_pd(T1,T2); — user1715122, Feb 18 '13 at 03:46
I guess that's the "almost none". Yeah, you simply have too little computation with respect to the amount of shuffling and data movement. — Mysticial, Feb 18 '13 at 03:48
Edited to show complete computations. It is surprising this is memory bound since zgemm by BLAS gives peak performance — user1715122, Feb 18 '13 at 04:09
@Mystical: In the following post: http://stackoverflow.com/questions/8389648/how-to-achieve-4-flops-per-cycle where you mention that "Each of these 12 instructions blocks are completely independent from each other - and take on average 6 cycles to execute.", does it mean that the 6 _mm_mul_pd() all get executed within 5 cycles? — user1715122, Feb 18 '13 at 05:31
Not quite. Each one takes 5 cycles from start to finish. But you can start a new one every single cycle. So as long as you have enough independent multiplications ready to execute at all times, you can average 1 per cycle - even though each individual multiple takes 5 cycles to do. — Mysticial, Feb 18 '13 at 05:35
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/24687/discussion-between-user1715122-and-mysticial) — user1715122, Feb 18 '13 at 08:01

ZGEMM using SSE, not giving speedup

0 Answers0