67

Does the following code just parallelize the first (outer) loops, or it parallelize the entire nested loops?

    #pragma omp parallel for
    for (int i=0;i<N;i++)
    { 
      for (int j=0;j<M;j++)
      {
       //do task(i,j)//
      }
    }

I just want to make sure if the above code will parallelize the entire nested for-loops (thus one thread directly related task(i,j)), or it only parallelizes the outer for-loop (thus it ensures that, for each parrallel thread with loop index i, its inner loop will be done sequentially in a single thread, which is very import).

nbro
  • 15,395
  • 32
  • 113
  • 196
user0002128
  • 2,785
  • 2
  • 23
  • 40

2 Answers2

77

The lines you have written will parallelize only the outer loop. To parallelize both you need to add a collapse clause:

#pragma omp parallel for collapse(2)
    for (int i=0;i<N;i++)
    { 
      for (int j=0;j<M;j++)
      {
       //do task(i,j)//
      }
    }

You may want to check OpenMP 3.1 specifications (sec 2.5.1) for more details.

Massimiliano
  • 7,842
  • 2
  • 47
  • 62
30

You will be able to better understand this with the following example. Let's do this with two threads.

#pragma omp parallel for num_threads(2)
for(int i=0; i< 3; i++) {
    for (int j=0; j< 3; j++) {
        printf("i = %d, j= %d, threadId = %d \n", i, j, omp_get_thread_num());
    }
}

then the result will be,

i = 0, j= 0, threadId = 0 
i = 0, j= 1, threadId = 0 
i = 0, j= 2, threadId = 0 
i = 1, j= 0, threadId = 0 
i = 1, j= 1, threadId = 0 
i = 1, j= 2, threadId = 0 
i = 2, j= 0, threadId = 1 
i = 2, j= 1, threadId = 1 
i = 2, j= 2, threadId = 1

That means, when you add #pragma omp parallel for to the uppermost for loop, the index of that for loop is divided among the threads. As you can see, when index of i is same the thread id is also the same.

Instead of that, we can parallel the combinations that we have in a nested for loop. In this example we can have following combinations of i and j.

i = 0, j= 0
i = 0, j= 1
i = 0, j= 2
i = 1, j= 0
i = 1, j= 1
i = 1, j= 2
i = 2, j= 0
i = 2, j= 1
i = 2, j= 2

In order to parallelize the code combination wise, we can add the collapse keyword as follows.

#pragma omp parallel for num_threads(2) collapse(2)
for(int i=0; i< 3; i++) {
    for (int j=0; j< 3; j++) {
        printf("i = %d, j= %d, threadId = %d \n", i, j, omp_get_thread_num());
    }
}

then the result will be as follows.

i = 0, j= 0, threadId = 0 
i = 0, j= 1, threadId = 0 
i = 1, j= 2, threadId = 1 
i = 2, j= 0, threadId = 1 
i = 2, j= 1, threadId = 1 
i = 2, j= 2, threadId = 1 
i = 0, j= 2, threadId = 0 
i = 1, j= 0, threadId = 0 
i = 1, j= 1, threadId = 0 

Then you can see that unlike before, for the same index i, there can be different thread ids ( when (i=1 and j=2 threadId=1) also (i=1 and j=0 threadId=0)). That means in this scenario, the combinations of i and j are divided among the threads.

Erangad
  • 671
  • 7
  • 16
  • 2
    With the loops properly nested, outer loop parallelization is usually best. If the outer loop count is not large compared to number of threads, the collapse mentioned above is a good method if the nested loops can be made eligible, and if it doesn't interfere with inner loop optimizations such as simd vectorization. – tim18 May 22 '18 at 12:41
  • 1
    not all loops are collapsable due to data dependency, so the answer in general is no, for nested parallelism is not gonna work for every loop, this is why people more move towards GPU's where there are three levels of parallelism – JimBamFeng Aug 05 '18 at 05:13