I am doing some research about how to improve matrix operation (using the double type) and I was trying some techniques such as cache-blocking and loop unrolling. The second one was really successful, but I cannot improve my performance using blocking. I don't if it is because I am doing something wrong or if it is due to blocking is not useful at all in this case.
The original code without the operation is:
for (int i=0; i<N; i++){
for (int j=0; j<N; j++){
d[i][j] = 0.0;
}
}
for (int i=0; i<N; i++){
for (int j=0; j<N; j++){
for (int k=0; k<K_MAX; k++){
d[i][j] += 2 * a[i][k] * (b[k][j] - c[k]);
}
}
}
Where K_MAX is always 8 and N takes values from 250 500 750 1000 1500 2000 2550 3000
And what I was trying to do with blocking was:
for (int i= 0 ; i<N; i+=block_size){
for (int j=0; j<N; j+=block_size){
for (int ii=i; ii<min (i+block_size, N); ii++){
for (int jj=j; jj<min(j+block_size, N); jj++){
d[ii][jj] = 0.0;
for (int k = 0; k<K_MAX; k++){
d[ii][jj] += 2 * a[ii][k] * ( b[k][jj]- c[k]);
}
}
}
}
}
I'm probably choosing a bad value for block_size because I did not understand how to choose a nice one, but I tried all the dividers of N to choose a block size, from 1 to N. Also, I tried using a multiple of the number of elements that fit on a cache line (8 doubles) like 8, 64, 128, 256, and 512 (I know N is not always a multiple of that value, it is necessary to handle elements that cannot be reached by the block, I tried and do it nicely because I have got right outputs), but the performance was not improved. I also tried using the same block size value for all the N ones, but as you can guess, nothing was achieved.
My processor is an Intel Core i7-10870H.
Thank you in advance