5

I am wondering if someone can help me following loop tiling optimization to minimize cache misses. I am working on four 4000x4000 float matrix multiplication.

The machine has three level caches with values: L1 128kb, L2 1Mb, and L3 8Mb

There are four nested loops and is not perfectly nested.

for(int i=0; i<NN; i++) {
    for (int j=0; j<NN; j++) {
        if (i != j){
            thirdlayer = 0;
            for (int k=0; k<NN; k++) {
                fourthlayer = 0;
                for (int l=0; l<NN; l++) {
                    fourthlayer =  fourthlayer + V[j*NN+l]*V[NN+l]*J[k*NN+l];
                }
                thirdlayer = thirdlayer + V[k]*V[i*NN+k]*fourthlayer;
            }
            if(pi_cod[j] != 0)
                Transitions[i*NN +j] =  sqrt(pi_cod[i]*pi_cod[1]/(pi_cod[0]*pi_cod[j]))*Q[i*NN +j]*thirdlayer/Padt;
        }
    }
}
ChrisWue
  • 18,612
  • 4
  • 58
  • 83
Adam Q
  • 51
  • 3
  • 2
    This is probably a better fit for codereview.stackexchange.com – ChrisWue Oct 03 '13 at 20:20
  • 3
    you are doing lots of redundant work. `V[j*NN+l]*V[NN+l]` can be computed way in the outer loop, etc, etc. I'd first look to reduce unnecessary computational overhead. – Anycorn Oct 03 '13 at 20:35
  • @Anycorn probably not, since `V[j*NN+l]*V[NN+l]` would seem to include the letter el not the number one. – Rian Rizvi Oct 03 '13 at 21:07
  • @RiazRizvi it doesn't depend on i or k - can be factored way out – Anycorn Oct 03 '13 at 21:23
  • @Anycorn Oh I see, `V[NN+l]` is a lookup from the constant address `&V+NN` so why not store this address in a local and then lookup using an offset l=0..NN-1? – Rian Rizvi Oct 03 '13 at 21:33
  • how to move V[j*NN+1]*V[NN+1] to outer loop since they are related to inner most loop l? – Adam Q Oct 07 '13 at 17:25
  • See related http://stackoverflow.com/questions/16699247/what-is-cache-friendly-code. Plus many similar posts in stackexchange. – Krazy Glew Oct 08 '13 at 19:36

0 Answers0