How would you use matrix traspotion to optimize this code for caches
for (int i = 0 ; i < SIZE ; i ++) {
for (int j = 0 ; j < SIZE ; j ++) {
dest[i][j] = src[j][i];
}
}
How would you use matrix traspotion to optimize this code for caches
for (int i = 0 ; i < SIZE ; i ++) {
for (int j = 0 ; j < SIZE ; j ++) {
dest[i][j] = src[j][i];
}
}
You have to know about the machine architecture to do this properly. But basically you usually want to divide the work amongst N - 1 threads (N being the number of threads available and take away one for the main manager thread) where the blocks of memory read/write access for each thread are broken into aligned cache-line sizes so the threads don't fight on the memory bus over common-memory hits.