I am doing some testing on cache efficient algorithms for matrix transposition and in the loop blocking technique I get a higher speed up when I have a smaller block size. Shouldn't larger block size result in higher speed up since we reduce the number of memory access by bringing more data into the cache?
here is the algorithm:
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < n; j += blocksize) {
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*n];
}
}
}
}