Optimizing a matrix transpose function with OpenMP

Question

I have this code that transposes a matrix using loop tiling strategy.

void transposer(int n, int m, double *dst, const double *src) {
   int blocksize;
   for (int i = 0; i < n; i += blocksize) {
      for (int j = 0; j < m; j += blocksize) {
          // transpose the block beginning at [i,j]
          for (int k = i; k < i + blocksize; ++k) {
              for (int l = j; l < j + blocksize; ++l) {
                  dst[k + l*n] = src[l + k*m];
               }
          }
      }
   }
}

I want to optimize this with multi-threading using OpenMP, however I am not sure what to do when having so many nested for loops. I thought about just adding #pragma omp parallel for but doesn't this just parallelize the outer loop?

Does this answer your question? [How does OpenMP handle nested loops?](https://stackoverflow.com/questions/13357065/how-does-openmp-handle-nested-loops) — Janez Kuhar, Jan 22 '22 at 08:44
@JanezKuhar somewhat, however i was more so wondering if any further optimizations could be done. In the thread you linked they only really talk about collapse etc — Marcus F, Jan 22 '22 at 08:45

score 3 · Accepted Answer · answered Jan 22 '22 at 14:05

When you try to parallelize a loop nest, you should ask yourself how many levels are conflict free. As in: every iteration writing to a different location. If two iterations write (potentially) to the same location, you need to 1. use a reduction 2. use a critical section or other synchronization 3. decide that this loop is not worth parallelizing, or 4. rewrite your algorithm.

In your case, the write location depends on k,l. Since k<n and l*n, there are no pairs k.l / k',l' that write to the same location. Furthermore, there are no two inner iterations that have the same k or l value. So all four loops are parallel, and they are perfectly nested, so you can use collapse(4).

You could also have drawn this conclusion by considering the algorithm in the abstract: in a matrix transposition each target location is written exactly once, so no matter how you traverse the target data structure, it's completely parallel.

@MarcusF "is there anything else": be sure to set `OMP_PROC_BIND=true`. This is certainly important on dual socket designs, but I would set it by default. If you don't your OS may migrate threads and you lose your cache locality. — Victor Eijkhout, Jan 23 '22 at 23:27
hmm i'm trying to understand ```collapse``` fully and i've read that loops can't be dependent on each other . But the two inner-most for loops has ```k = i```and ```l = j``` doesn't that mean they are dependent? Can you clarify a example of data dependency ? — Marcus F, Jan 24 '22 at 09:04

score 2 · Answer 2 · answered Jan 22 '22 at 08:36

You can use the collapse specifier to parallelize over two loops.

#   pragma omp parallel for collapse(2) 
    for (int i = 0; i < n; i += blocksize) {
        for (int j = 0; j < m; j += blocksize) {
            // transpose the block beginning at [i,j]
            for (int k = i; k < i + blocksize; ++k) {
                for (int l = j; l < j + blocksize; ++l) {
                    dst[k + l*n] = src[l + k*m];
                }
            }
        }
    }

As a side-note, I think you should swap the two innermost loops. Usually, when you have a choice between writing sequentially and reading sequentially, writing is more important for performance.

dreamcrash · Answer 3 · 2022-01-22T15:59:33.863

I thought about just adding #pragma omp parallel for but doesnt this just parallelize the outer loop?

Yes. To parallelize multiple consecutive loops one can utilize OpenMP' collapse clause. Bear in mind, however that:

(As pointed out by Victor Eijkhout). Even though this does not directly apply to your code snippet, typically, for each new loop to be parallelized one should reason about potential newer race-conditions e.g., that this parallelization might have added. For example, different threads writing concurrently into the same dst position.
in some cases parallelizing nested loops may result in slower execution times than parallelizing a single loop. Since, the concrete implementation of the collapse clause uses a more complex heuristic (than the simple loop parallelization) to divide the iterations of the loops among threads, which can result in an overhead higher than the gains that it provides.

You should try to benchmark with a single parallel loop and then with two, and so on, and compare the results, accordingly.

void transposer(int n, int m, double *dst, const double *src) {
    int blocksize;
    #pragma omp parallel for collapse(...)
    for (int i = 0; i < n; i += blocksize)
       for (int j = 0; j < m; j += blocksize)
           for (int k = i; k < i + blocksize; ++k
               for (int l = j; l < j + blocksize; ++l)
                  dst[k + l*n] = src[l + k*m];
}

Depending upon the number of threads, cores, size of the matrices among other factors it might be that running sequential would actually be faster than the parallel versions. This is specially true in your code that is not very CPU intensive (i.e., dst[k + l*n] = src[l + k*m];)

Optimizing a matrix transpose function with OpenMP

3 Answers3