How smaller block size in cache efficient algorithm (blocking algorithm) results in higher speedup?

Question

I am doing some testing on cache efficient algorithms for matrix transposition and in the loop blocking technique I get a higher speed up when I have a smaller block size. Shouldn't larger block size result in higher speed up since we reduce the number of memory access by bringing more data into the cache?

here is the algorithm:

for (int i = 0; i < n; i += blocksize) {
    for (int j = 0; j < n; j += blocksize) {
        for (int k = i; k < i + blocksize; ++k) {
            for (int l = j; l < j + blocksize; ++l) {
                dst[k + l*n] = src[l + k*n];
            }
        }
    }
}

this benchmark is far from complete, and no information can be reliably deduced from it, you need to show a minimal reproducible example. — Ahmed AEK, Oct 23 '22 at 18:53
Cache is not only a matter of large/small blocks. In a set-associative cache (like most caches are) if the data is too large and the data is continuous you will fill up all the spaces in the set, resulting in more cache misses (and invalidation of lines) — Fra93, Oct 23 '22 at 19:11

score 1 · Accepted Answer · answered Oct 23 '22 at 19:30

Well, it depends, though it is often true.

In your (transposition) code, the access is done contiguously for src but with a stride for dst. When the block size is small, the cache lines of dst can be reused in the cache so there is not much loss. However, when the block size is big, a lot of cache lines needs to be fetched for k=0 and they may not fit all in the cache regarding the stride. Indeed, caches are typically N-way set-associative and using a big power-of-two stride tends to cause cache trashing. In the worst case, only few cache lines can be kept (L1 cache are typically 4 or 8 way-associative). In this case, for k=1, all cache lines needs to be fetched again due to conflict misses. A good solution to avoid this issue is to add some padding so to strongly reduce conflict misses. An alternative solution is to make a copy so to avoid reading/writing data too much with a huge stride. Alternatively, non-temporal instructions could help a bit too. This related post provides some details insights about such a problem.

How smaller block size in cache efficient algorithm (blocking algorithm) results in higher speedup?

1 Answers1