43

I came across an OpenMP code that had the collapse clause, which was new to me. I'm trying to understand what it means, but I don't think I have fully grasped it's implications; One definition that I found is:

COLLAPSE: Specifies how many loops in a nested loop should be collapsed into one large iteration space and divided according to the schedule clause. The sequential execution of the iterations in all associated loops determines the order of the iterations in the collapsed iteration space.

I thought I understood what that meant, so I tried the follwoing simple program:

int i, j;
#pragma omp parallel for num_threads(2) private(j)
for (i = 0; i < 4; i++)
    for (j = 0; j <= i; j++)
        printf("%d %d %d\n", i, j, omp_get_thread_num());

Which produced

0 0 0
1 0 0
1 1 0
2 0 0
2 1 0
2 2 1
3 0 1
3 1 1
3 2 1
3 3 1

I then added the collapse(2) clause. I expected to have the same result in the first two columns but now have an equal number of 0's and 1's in the last column. But I got

0 0 0
1 0 0
2 0 1
3 0 1

So my questions are:

  1. What is happening in my code?
  2. Under what circumstances should I use collapse?
  3. Can you provide an example that shows the difference between using collapse and not using it?
fuz
  • 88,405
  • 25
  • 200
  • 352
iomartin
  • 3,149
  • 9
  • 31
  • 47
  • Good question. You're trying to fuse a triangular double loop. I don't think collapse works for that. It needs to be a square double loop. [Others on SO have said collapse works with triangular loops](https://stackoverflow.com/questions/23950056/how-to-parallel-nested-loop-to-find-the-nearest-two-point-in-openmp). I have not read the specification. If you want to fuse a triangular loop then see this [question](https://stackoverflow.com/questions/24013832/fusing-a-triangle-loop-for-parallelization-calculating-sub-indices). Although, I know a better way to do that now using induction variables. – Z boson Feb 12 '15 at 16:42
  • But if it's a square double loop, what is the benefit of using collapse? Each thread will get the same number of iterations either way. – iomartin Feb 12 '15 at 16:47
  • If you have two nested loops over `n` and `m` before you collapse each thread gets `n/nthreads` iterations whereas after you collapse it's `n*m` iterations. This can help e.g. when `n` is not very large relative to `nthreads` but `n*m` is. – Z boson Feb 12 '15 at 16:53
  • If you use C99, it saves you the trouble of privatizing your loop indices... #pragma omp parallel for for (int i = 0; i < 4; i++) for (int j = 0; j <= i; j++) printf("%d %d %d\n", i, j, omp_get_thread_num()); – Jeff Hammond Feb 12 '15 at 23:07
  • Current un-collapsed output is incorrect and shows 5 outputs for each thread -- should only be outer loop values 0 and 2 for thread #0 (i.e. 0 0 0, 2 0 0, 2 1 0) the other outputs should be with thread #1. – ofer.sheffer Apr 12 '17 at 15:53
  • @iomartin: The first output that you have shown is incorrect. It should be: {0 0 0}, {1 0 0}, {1 1 0}, {2 0 1}, {2 1 1}, {2 2 1}, {3 0 1}, {3 1 1}, {3 2 1}, {3 3 1}. I am surprised nobody pointed it out. – Gaurav Saxena Aug 06 '17 at 14:39

2 Answers2

54

The problem with your code is that the iterations of the inner loop depend on the outer loop. According to the OpenMP specification under the description of the section on binding and the collapse clause:

If execution of any associated loop changes any of the values used to compute any of the iteration counts, then the behavior is unspecified.

You can use collapse when this is not the case for example with a square loop

#pragma omp parallel for private(j) collapse(2)
for (i = 0; i < 4; i++)
    for (j = 0; j < 100; j++)

In fact this is a good example to show when to use collapse. The outer loop only has four iterations. If you have more than four threads then some will be wasted. But when you collapse the threads will distribute among 400 iterations which is likely to be much greater than the number of threads. Another reason to use collapse is if the load is not well distributed. If you only used four iterations and the fourth iteration took most of the time the other threads wait. But if you use 400 iterations the load is likely to be better distributed.

You can fuse a loop by hand for the code above like this

#pragma omp parallel for
for(int n=0; n<4*100; n++) {
    int i = n/100; int j=n%100;

Here is an example showing how to fuse a triply fused loop by hand.

Finally, here is an example showing how to fuse a triangular loop which collapse is not defined for.


Here is a solution that maps a rectangular loop to the triangular loop in the OPs question. This can be used to fuse the OPs triangular loop.

//int n = 4;
for(int k=0; k<n*(n+1)/2; k++) {
    int i = k/(n+1), j = k%(n+1);
    if(j>i) i = n - i -1, j = n - j;
    printf("(%d,%d)\n", i,j);
}

This works for any value of n.

The map for the OPs question goes from

(0,0),
(1,0), (1,1),
(2,0), (2,1), (2,2),
(3,0), (3,1), (3,2), (3,3),

to

(0,0), (3,3), (3,2), (3,1), (3,0),
(1,0), (1,1), (2,2), (2,1), (2,0),

For odd values of n the map is not exactly a rectangle but the formula still works.

For example n = 3 gets mapped from

(0,0),
(1,0), (1,1),
(2,0), (2,1), (2,2),

to

(0,0), (2,2), (2,1), (2,0),
(1,0), (1,1),

Here is code to test this

#include <stdio.h>
int main(void) {
    int n = 4;
    for(int i=0; i<n; i++) {
        for(int j=0; j<=i; j++) {
            printf("(%d,%d)\n", i,j);
        }
    }
    puts("");
    for(int k=0; k<n*(n+1)/2; k++) {
        int i = k/(n+1), j = k%(n+1);
        if(j>i) i = n - i - 1, j = n - j;
        printf("(%d,%d)\n", i,j);
    }
}
Gilles
  • 9,269
  • 4
  • 34
  • 53
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • @Gilles, why did you add the comment `` to my answer? What's the point of doing that. I am not complaining. I just don't know what it's for. – Z boson Dec 03 '15 at 21:28
  • 3
    I just added the C syntax highlighting hint as described [here](http://meta.stackexchange.com/questions/184108/what-is-syntax-highlighting-and-how-does-it-work/184109#184109). Indeed, on my browser, all your code snippets were displayed a grim grey. Now, on my browser at least, but I guess on many others' too, the C syntax is coloured. OK, the indexes in the output snippets are too, which might be unwanted, but it can be fixed if you want? Anyway, I didn't want to intrude, but I thought a good answer deserves good colours... Did I go too far? – Gilles Dec 04 '15 at 05:01
  • 1
    @Gilles, I was not aware of that. Thank you! I don't mind at all that you improved my answer. – Z boson Dec 04 '15 at 08:16
  • But I didn't get what is the parameter mean? collapse(2) what is 2 ?! – N0rA Mar 05 '16 at 16:04
  • 2
    @N0rA The number of loops. `collapse(n)` collapses the following `n` nested loops into a single parallel loop shared by the threads. – Chris Swierczewski Jul 24 '17 at 20:35
0

If your purpose is balancing the load over increasing rows, assuming the workload for each item is regular or well scattered, then how about folding the row indices in half, and forgetting about the collapse clause?

#pragma omp for
for (int iy0=0; iy0<n; ++iy0){
  int iy = iy0;
  if (iy0 >= n/2) iy = n-1 -iy0 +n/2;
  for (int ix=iy+1; ix<n; ++ix){
    work(ix, iy);
  }
}
h2kyeong
  • 447
  • 3
  • 13