How to properly use openMP

Question

I am trying to speed up my MPI project by utilizing openMP. I have a dataset of 1000 2d points, and am using a brute force algorithm to find the minimum distance in the 2d graph. When I try to split the thread of execution, however, it greatly hurts the performance.How can I properly utilize openMP?

Here is my attempt:

double calcDistance(double input[][2], int start, int stop){

    double temp;
    //declare and initialize minimum
    double minimum =  pow (((input[start+1][0]) - (input[start][0])),2) + pow(((input[start+1][1]) - (input[start][1])),2);
    minimum = sqrt(minimum);

    closestIndex1 = start;
    closestIndex2 = start+1;
    //Brute Force Algorithm to find minimum distance in dataset.

        #pragma omp parallel for
        for(int i=start; i<stop;i++){
            for(int j=start; j<stop;j++){

                    temp = pow(((input[j][0]) - (input[i][0])),2) + pow(((input[j][1]) - (input[i][1])),2);
                    temp = sqrt(temp);
                    if(temp < minimum && i < j){
                            minimum = temp; 
                            closestIndex1 = i;
                            closestIndex2 = j;
                    }//endif
            }//end j
          }//end i
return minimum;
}

I have to say WOW. Thank you, that was incredibly helpful, and really cleared up a bunch of questions that I had. Again, thank you, gha.st.

Does the program give the right answer? Don't you need something like `#pragma omp critical` before `return minimum` to force it to wait until all threads are complete? It also seems like you need some private variables - does the program give the same answer every time it is run? — James King, May 03 '14 at 03:22
Yes, the program runs fine, and the answers are correct and always the same. Its just the performance suffers. With just mpi the program exeutes in ~.06-.08s. Adding openmp, ~.5-.8s. — pdf2e, May 03 '14 at 03:34
I added some serial optimizations, giving a 5x speedup in the test case — danielschemmel, May 03 '14 at 05:49

score 13 · Accepted Answer · edited Jun 20 '20 at 09:12

Analysis

First, it is pure luck that your program seems to work like this. You do indeed have a data race, that causes invalid results on my machine. Consider the following test harness for this post:

::std::cout << ::xtd::target_info() << "\n\n"; // [target os] [target architecture] with [compiler]

static const int count = 30000;
auto gen = ::std::bind(::std::normal_distribution<double>(0, 1000), ::std::mt19937_64(42));
std::unique_ptr<double[][2]> input(new double[count][2]);
for(size_t i = 0; i < count; ++i)
{
    input[i][0] = gen();
    input[i][1] = gen();
}

::xtd::stopwatch sw; // does what its name suggests
sw.start();
double minimum = calcDistance(input.get(), 0, count);
sw.stop();
::std::cout << minimum << "\n";
::std::cout << sw << "\n";

When executing your function with the omp pragma removed, its output is:

Windows x64 with icc 14.0

0.0559233
7045 ms

or

Windows x64 with msvc VS 2013 (18.00.21005)

0.0559233
7272 ms

When executing it with the omp pragma intact, its output is:

Windows x64 with icc 14.0

0.324085
675.9 ms

or

Windows x64 with msvc VS 2013 (18.00.21005)

0.0559233
4338 ms

Since the machine uses 24 threads (on 12 cores with HT enabled), the speedup is evident, but could possibly be better, at least for msvc. The compiler that generates a faster program (icc) also shows the data race by giving wrong results which are different each run.

Note: I was also able to see an incorrect result from msvc, when compiling a debug version for x86 with 10k iterations.

Proper usage of iteration-local variables

The temp variable in your code has a lifetime of one iteration of the innermost loop. By moving its scope to match its lifetime, we can eliminate one source of data races. I also took the liberty of removing two unused variables and changing the initialization of minimum to a constant:

double calcDistance(double input[][2], int start, int stop){
    double minimum = ::std::numeric_limits<double>::infinity();
    //#pragma omp parallel for // still broken
    for(int i = start; i < stop; i++){
        for(int j = start; j < stop; j++) {
            double temp = pow(((input[j][0]) - (input[i][0])), 2) + pow(((input[j][2]) - (input[i][3])), 2);
            temp = sqrt(temp);
            if(temp < minimum && i < j) minimum = temp;
        }
    }
    return minimum;
}

A proper OMP minimum computation

OMP has support for reductions, which will in all probability perform fairly well. To try it, we will use the following pragma, which ensures that each thread works on its own minimum variable which are combined using the minimum operator:

#pragma omp parallel for reduction(min: minimum)

The results validate the approach for ICC:

Windows x64 with icc 14.0

0.0559233
622.1 ms

But MSVC howls error C3036: 'min' : invalid operator token in OpenMP 'reduction' clause, because it does not support minimum reductions. To define our own reduction, we will use a technique called double-checked locking:

double calcDistance(double input[][2], int start, int stop){
    double minimum = ::std::numeric_limits<double>::infinity();
    #pragma omp parallel for
    for(int i = start; i < stop; i++){
        for(int j = start; j < stop; j++) {
            double temp = pow(((input[j][0]) - (input[i][0])), 2) + pow(((input[j][1]) - (input[i][1])), 2);
            temp = sqrt(temp);
            if(temp < minimum && i < j)
            {
                #pragma omp critical
                if(temp < minimum && i < j) minimum = temp;
            }
        }
    }
    return minimum;
}

This is not only correct, but also leads to comparable performance for MSVC (note that this is significantly faster than the incorrect version!):

Windows x64 with msvc VS 2013 (18.00.21005)

0.0559233
653.1 ms

The performance of ICC does not suffer significantly:

Windows x64 with icc 14.0

0.0559233
636.8 ms

Serial optimizations

While the above is a proper parallelization of your serial code, it can be significantly optimized by considering that you are computing a whole bunch of temp results you are never going to use due to your i < j condition.

By simply changing the start point of the inner loop, not only will this computation be completely avoided, it also simplifies the loop conditions.

Another trick we use is delaying the sqrt computation until the last possible second, since it is a homomorphic transformation, we can just sort on the square of the distance.

Finally, calling pow for a square is fairly inefficient, since it incurs a ton of overhead that we do not need.

This leads to the final code

double calcDistance(double input[][2], int start, int stop){
    double minimum = ::std::numeric_limits<double>::infinity();
    #pragma omp parallel for
    for(int i = start; i < stop; i++) {
        for(int j = i + 1; j < stop; j++) {
            double dx = input[j][0] - input[i][0];
            dx *= dx;
            double dy = input[j][1] - input[i][1];
            dy *= dy;
            double temp = dx + dy;
            if(temp < minimum)
            {
                #pragma omp critical
                if(temp < minimum) minimum = temp;
            }
        }
    }
    return sqrt(minimum);
}

Leading to the final performance:

Windows x64 with icc 14.0

0.0559233
132.7 ms

or

Windows x64 with msvc VS 2013 (18.00.21005)

0.0559233
120.1 ms

+1 for teaching me about double-checked locking. That's interesting! However, I don't think it's the most efficient way to do this reduction. In the worst case each thread would still have to wait every iteration (making it effectively serial). A more efficient method is to declare private version of min for each thread and then merge them in a critical section. — Z boson, May 04 '14 at 15:16
@Zboson Correct, in fact, that is similar to what the `reduction (min: minimum)` solution does internally. In the average case, this simple variant works almost as well with a lot less effort however (as can be seen from the timings). — danielschemmel, May 05 '14 at 08:26
I think the double-checked locking only works if writes to `minimum` are atomic. In that case, you don't need the `critical` pragma. — Patrick, May 06 '14 at 16:12
@Patrick Yes, atomic writes are [guaranteed](http://stackoverflow.com/questions/1292786/is-updating-double-operation-atomic) and necessary for double checked locking. The `critical` pragma however is needed to guarantee that the second check is valid for the write. One could use [`omp atomic`](https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/cref_cls/common/cppref_pragma_omp_atomic.htm) to ensure the atomicity of the write or a [CAS](http://msdn.microsoft.com/en-us/library/windows/desktop/ms683562.aspx) instead of the critical section. — danielschemmel, May 06 '14 at 17:01
@gha.st Yes, you're right about the `critical` pragma. Thank you for the explanation. — Patrick, May 06 '14 at 17:10
I am didn't understand why there is twice the statement `if(temp < minimum)`. Any good reason for this? — BRabbit27, Jun 10 '15 at 15:59
@BRabbit27 As explained in the answer, it is a technique called [double checked locking](http://en.wikipedia.org/wiki/Double-checked_locking). The basic idea is to do an optimistic check without locking that only locks (enters the critical section) if there is a **chance** it might work. Inside it of course still has to be checked if it **truly** is ok to pass the guard. — danielschemmel, Jun 10 '15 at 19:27

How to properly use openMP

1 Answers1

Analysis

Proper usage of iteration-local variables

A proper OMP minimum computation

Serial optimizations