C++ OpenMP: nested loops where the inner iterator depends on the outer one

Question

Consider the following code:

#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
#include <cmath>
#include <omp.h>

using namespace std;

typedef std::chrono::steady_clock myclock;

double measure_time(myclock::time_point begin, myclock::time_point end)
{
    return std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()/(double)1e6;
}

int main()
{
    int n = 20000;
    vector<double> v(n);
    iota(v.begin(), v.end(), 1.5);

    vector< vector<double> > D(n, vector<double>(n,0.0));

    myclock::time_point begin, end;

    begin = myclock::now();

    //#pragma omp parallel for collapse(2)
    //#pragma omp parallel for
    for(size_t i = 0; i < n - 1; i++){
        for(size_t j = i+1; j < n; j++){
            double d = sqrt(v[i]*v[i] + v[j]*v[j] + 1.5*v[i]*v[j]);
            D[i][j] = d;
            D[j][i] = d;
        }
    }

    end= myclock::now();
    double time = measure_time(begin, end);
    cout<<"Time: "<<time<<" (s)"<<endl;
    return 0;
}

For compiling:

g++ -std=c++11 -fopenmp -o main main.cpp

I obtained the following run time:

With #pragma omp parallel for collapse(2): 7.9425 (s)
With #pragma omp parallel for: 3.73262 (s)
Without OpenGM: 11.0935 (s)

System settings: Linux Mint 18.3 64-bit, g++ 5.4.0, quad-core processor.

I would expect the first to be faster than the second (which parallelizes only the outer loop) and much faster than the third.

What did I do wrong please? The first and the second both ran on all 8 threads.

Thank you in advance for your help!

Please provide a [mcve] as well as a detailed system description. It is impossible to provide a specific answer based on this code-snippet. — Zulan, Dec 12 '17 at 08:20
Collapse does not work with loops with dependencies https://stackoverflow.com/questions/28482833/understanding-the-collapse-clause-in-openmp/28483812#28483812 — Z boson, Dec 13 '17 at 12:10

score 5 · Accepted Answer · answered Dec 13 '17 at 12:43

The collapse clause should not be used when the iterations are depended on another loop. See Understanding the collapse clause in openmp.

In your case you are running over the lower triangle of a matrix (excluding the diagonal) because of symmetry. This cuts the number of iterations roughly in half. If you want to fuse/collapse the double loop you can do it by hand like this (see the end of this answer for more details).

for(size_t k=0; k<n*(n-1)/2; k++) {
    size_t i = k/n, j = k%n;
    if(j<=i) i = n - i - 2, j = n - j - 1;
    double d = sqrt(v[i]*v[i] + v[j]*v[j] + 1.5*v[i]*v[j]);
    D[i][j] = d;
    D[j][i] = d;
}

I think most people assume that collapsing a loop is going to give better performance but this is often not the case. In my experience most of the time there is no difference in performance but in some cases it's much worse due to cache issues. In a few cases it's better. You have to test yourself.

As to why your code was twice as slow with the collapse clause I can only guess that since the effect is unspecified for the inner loop that your OpenMP implementation ran over j from [0,n) i.e. the full matrix rather than half the matrix.

Nice. With your code it is a little bit faster: ~3.3s. Maybe it will scale better on more expensive tasks. Thank you very much! — f10w, Dec 14 '17 at 15:50

C++ OpenMP: nested loops where the inner iterator depends on the outer one

1 Answers1

Linked