How to solve dependencies in for loop in a multithread code?

Question

I have trouble resolving dependecies in for loop using OpenMP so the program will execute faster. This is how I did it and it works, but I need a faster solution. Does anybody know to do this so it will work faster?

        #pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(openSet, maxVal, current, fScores)
        for(i = 0;i < openSet.size();i++){
            if(fScores[openSet[i].x * dim + openSet[i].y] < maxVal){
                #pragma omp ordered
                maxVal = fScores[openSet[i].x * dim + openSet[i].y];
                current = openSet[i];
            }
        }

and the second for loop is this one:

    #pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(neighbours, openSet, gScores, fScores, tentative_gScore)
    for(i = 0;i < neighbours.size();i++){
        #pragma omp ordered
        tentative_gScore = gScores[current.x * dim + current.y] + 1;

        if(tentative_gScore < gScores[neighbours[i].x * dim + neighbours[i].y]){
            cameFrom[neighbours[i].x * dim + neighbours[i].y] = current;
            gScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore;
            fScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore + hScore(); //(p.x, p.y, xEnd, yEnd)
            if(contains(openSet, neighbours[i]) == false){
                openSet.push_back(neighbours[i]);
            }
        }
    }

EDIT: I didn't mention what i was even doing here. I was implementing A* algorithm and this code is from wikipedia. Plus i wana add 2 more variables so i don't confuse anyone.

    PAIR current = {};
    int maxVal = INT32_MAX;

schorsch312 · Answer 1 · 2021-01-20T16:41:32.767

First of all, you need to make sure, that this is your hot spot. Then us a proper test suite, in order to make sure that you actually gain performance. Use a tool such as ´google_benchmark´. Make sure you compiled in release mode, otherwise your measurements are completely spoiled.

This said, I think you are looking for the max reduction

 #pragma omp parallel for reduction(max : maxVal )
    for(i = 0;i < openSet.size();i++){
        if(fScores[openSet[i].x * dim + openSet[i].y] > maxVal){

            maxVal = fScores[openSet[i].x * dim + openSet[i].y];

        }
    }

´current´ seams to be superfluous. I think the comparison has been mixed up.

Can you access the data in ´fScores´ in a linear fashion. You will have a lot of cache misses using the indirection over ´openSet´. If you can get rid of this indirection somehow, you will have a high speedup in single and multi-threaded scenarios.

In the second loop the ´push_back´ will spoil your performance. I had a similar problem. For me it was very beneficial to

create a vector with the maximal possilbe length
initialise it with an empty value
set it properly using openmp, where a criterion was fulfilled.
Check for the empty value, when using the vector.

dreamcrash · Answer 2 · 2021-01-25T12:38:14.820

It seems to me that you misunderstood what the OpenMP ordered clause actually does. From the OpenMP Standard one can read:

The ordered construct either specifies a structured block in a worksharing-loop, simd, or worksharing-loop SIMD region that will be executed in the order of the loop iterations, or it is a stand-alone directive that specifies cross-iteration dependences in a doacross loop nest. The ordered construct sequentializes and orders the execution of ordered regions while allowing code outside the region to run in parallel.

or more informally:

The ordered clause works like this: different threads execute concurrently until they encounter the ordered region, which is then executed sequentially in the same order as it would get executed in a serial loop.

Based on the way you have used it, it seems that you have mistaken the ordered clause for the OpenMP critical clause:

The critical construct restricts execution of the associated structured block to a single thread at a time.

Therefore, with the ordered clause your code is basically running sequentially, with the additional overhead of the parallelism. Nevertheless, even if you have used the critical constructor instead, the overhead would be too high, since threads would be locking in every loop iteration.

At first glance for the first loop you could use the OpenMP reduction clause (i.e., reduction(max :maxVal)), which from the standard one can read:

The reduction clause can be used to perform some forms of recurrence calculations (...) in parallel. For parallel and work-sharing constructs, a private copy of each list item is created, one for each implicit task, as if the private clause had been used. (...) The private copy is then initialized as specified above. At the end of the region for which the reduction clause was specified, the original list item is updated by combining its original value with the final value of each of the private copies, using the combiner of the specified reduction-identifier.

For a more detailed explanation on how the reduction clause works have a look a this SO Thread.

Notwithstanding, you are updating two variables, namely maxVal and current. Hence, making it harder to solve those dependencies with the reduction clause alone. Nonetheless, one approach is to create a shared data structure among the threads, where each thread updates a given position of that shared structure. At the end of the parallel region, the master thread update the original values of maxVal and current, accordingly.

So instead of:

 #pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(openSet, maxVal, current, fScores)
    for(i = 0;i < openSet.size();i++){
        if(fScores[openSet[i].x * dim + openSet[i].y] < maxVal){ // <-- you meant '>' not '<'
            #pragma omp ordered
            maxVal = fScores[openSet[i].x * dim + openSet[i].y];
            current = openSet[i];
        }
    }

you could try the following:

    int shared_maxVal[tc] = {INT32_MAX};
    int shared_current[tc] = {0};
    #pragma omp parallel num_threads(tc)
    { 
        int threadID = omp_get_thread_num();
        #pragma omp for shared(openSet, fScores)
        for(int i = 0;i < openSet.size();i++){ 
           if(fScores[openSet[i].x * dim + openSet[i].y] > shared_maxVal[threadID]){
              shared_maxVal[threadID] = fScores[openSet[i].x * dim + openSet[i].y];
              shared_current[threadID] = openSet[i];
           }
        }
    }
 
    for(int i = 0; i < tc; i++){
       if(maxVal < shared_maxVal[i]){
          maxVal = shared_maxVal[i];
          current = shared_current[i];
       }
    }

For your second loop:

#pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(neighbours, openSet, gScores, fScores, tentative_gScore)
for(i = 0;i < neighbours.size();i++){
    #pragma omp ordered
    tentative_gScore = gScores[current.x * dim + current.y] + 1;

    if(tentative_gScore < gScores[neighbours[i].x * dim + neighbours[i].y]){
        cameFrom[neighbours[i].x * dim + neighbours[i].y] = current;
        gScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore;
        fScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore + hScore(); //(p.x, p.y, xEnd, yEnd)
        if(contains(openSet, neighbours[i]) == false){
            openSet.push_back(neighbours[i]);
        }
    }
}

Some of the aforementioned advice still holds. Moreover, do not make the variable tentative_gScore shared among threads. Otherwise, you need to guarantee mutual exclusion on the accesses to that variable. As it is your code has a race-condition, namely threads may update the variable tentative_gScore while other threads are reading it. Simply declare the tentative_gScore variable inside the loop so that it is private to each thread.

Assuming that different threads cannot access to the same position of the arrays cameFrom, gScores and fScores, the next thing you need to do is to create an array of openSets, and assign each position of that array to a different thread. In this manner, threads can update their respective positions without having to use some synchronization mechanism. At the end of the parallel region merge the shared structure to the same (original) openSet.

Your second loop might loop like the following:

    // Create an array of "openSets" let us named "shared_openSet"
    #pragma omp parallel num_threads(tc)
    {  
       int threadID = omp_get_thread_num();
       #pragma omp for shared(neighbours, gScores, fScores)
       for(int i = 0;i < neighbours.size();i++){
        // I just assume the type in but you can change if for the real type
        int tentative_gScore = gScores[current.x * dim + current.y] + 1;

        if(tentative_gScore < gScores[neighbours[i].x * dim + neighbours[i].y]){
            cameFrom[neighbours[i].x * dim + neighbours[i].y] = current;
            gScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore;
            fScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore + hScore();
            if(contains(openSet, neighbours[i]) == false){
                shared_openSet[threadID].push_back(neighbours[i]);
            }
        }
      }
   }
  // merge all the elements from shared_openSet into openSet.

How to solve dependencies in for loop in a multithread code?

2 Answers2