I thought about just adding #pragma omp parallel for but doesnt this
just parallelize the outer loop?
Yes. To parallelize multiple consecutive loops one can utilize OpenMP' collapse clause. Bear in mind, however that:
(As pointed out by Victor Eijkhout). Even though this does not directly apply to your code snippet, typically, for each new loop to be parallelized one should reason about potential newer race-conditions e.g., that this parallelization might have added. For example, different threads writing concurrently into the same dst position.
in some cases parallelizing nested loops may result in slower execution times than parallelizing a single loop. Since, the concrete implementation of the collapse clause uses a more complex heuristic (than the simple loop parallelization) to divide the iterations of the loops among threads, which can result in an overhead higher than the gains that it provides.
You should try to benchmark with a single parallel loop and then with two, and so on, and compare the results, accordingly.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
#pragma omp parallel for collapse(...)
for (int i = 0; i < n; i += blocksize)
for (int j = 0; j < m; j += blocksize)
for (int k = i; k < i + blocksize; ++k
for (int l = j; l < j + blocksize; ++l)
dst[k + l*n] = src[l + k*m];
}
Depending upon the number of threads, cores, size of the matrices among other factors it might be that running sequential would actually be faster than the parallel versions. This is specially true in your code that is not very CPU intensive (i.e., dst[k + l*n] = src[l + k*m];
)