Can cache blocking be worst than original code?

Question

I am doing some research about how to improve matrix operation (using the double type) and I was trying some techniques such as cache-blocking and loop unrolling. The second one was really successful, but I cannot improve my performance using blocking. I don't if it is because I am doing something wrong or if it is due to blocking is not useful at all in this case.

The original code without the operation is:

    for (int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            d[i][j] = 0.0;
        }
    }

    for (int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            for (int k=0; k<K_MAX; k++){
                d[i][j] += 2 * a[i][k] * (b[k][j] - c[k]);
            }   
        }
    }

Where K_MAX is always 8 and N takes values from 250 500 750 1000 1500 2000 2550 3000

And what I was trying to do with blocking was:

    for (int i= 0 ; i<N; i+=block_size){
        for (int j=0; j<N; j+=block_size){      
            for (int ii=i; ii<min (i+block_size, N); ii++){
                for (int jj=j; jj<min(j+block_size, N); jj++){      
                    d[ii][jj] = 0.0;
                    for (int k = 0; k<K_MAX; k++){
                        d[ii][jj] += 2 * a[ii][k] * ( b[k][jj]- c[k]);
                    }
                }
            }
        }
    }

I'm probably choosing a bad value for block_size because I did not understand how to choose a nice one, but I tried all the dividers of N to choose a block size, from 1 to N. Also, I tried using a multiple of the number of elements that fit on a cache line (8 doubles) like 8, 64, 128, 256, and 512 (I know N is not always a multiple of that value, it is necessary to handle elements that cannot be reached by the block, I tried and do it nicely because I have got right outputs), but the performance was not improved. I also tried using the same block size value for all the N ones, but as you can guess, nothing was achieved.

My processor is an Intel Core i7-10870H.

Thank you in advance

You're trying to optimize access to `d` but the cache unfriendly access is for `b`. Your modified code doesn't address that [AFAICT]. Also, the compiler optimizer _might_ realize that for `d[i][j]`, `i` and `j` are invariant in the inner `k` loop, but you could help with: `double dc = d[i][j]; for (k = ....) dc += ...; d[i][j] = dc;` — Craig Estey, May 05 '22 at 19:02
You _might_ have better luck by transposing `b` in a separate pass to produce `t`. Then, whenever you have `b[k][j]` you use `t[j][k]`. A web search on: `cache friendly matrix transpose` produces: https://stackoverflow.com/questions/5200338/a-cache-efficient-matrix-transpose-program amongst others. At the very least, this should help with your block design. — Craig Estey, May 05 '22 at 19:08
Again, hopefully, the compiler optimizer realizes that it can eliminate a multiply when accessing `a[i][k]` because `i` is invariant across the `k` loop. You can help it with: `double *a2 = a[i]; for (k = ...) dc += a2[k] + ...;` — Craig Estey, May 05 '22 at 19:13
I've done similar optimizations before. See my answer: https://stackoverflow.com/questions/39403215/is-it-possible-to-parallelize-this-for-loop/39404247#39404247 — Craig Estey, May 05 '22 at 19:19
As I reviewed my previous [linked] answer, I have a question. Depending upon what else you do with `b`, could you read in the data (i.e. populate `b`) as a transposed matrix? Then, there is no need to create `t` [as above] because `b` is already in cache friendly order. — Craig Estey, May 05 '22 at 19:25

Can cache blocking be worst than original code?

0 Answers0