It seems to me that you misunderstood what the OpenMP ordered clause actually does. From the OpenMP Standard one can read:
The ordered construct either specifies a structured block in a
worksharing-loop, simd, or worksharing-loop SIMD region that will be
executed in the order of the loop iterations, or it is a stand-alone
directive that specifies cross-iteration dependences in a doacross
loop nest. The ordered construct sequentializes and orders the
execution of ordered regions while allowing code outside the region to
run in parallel.
or more informally:
The ordered clause works like this: different threads execute
concurrently until they encounter the ordered region, which is then
executed sequentially in the same order as it would get executed in a
serial loop.
Based on the way you have used it, it seems that you have mistaken the ordered clause for the OpenMP critical clause:
The critical construct restricts execution of the associated
structured block to a single thread at a time.
Therefore, with the ordered clause your code is basically running sequentially, with the additional overhead of the parallelism. Nevertheless, even if you have used the critical constructor instead, the overhead would be too high, since threads would be locking in every loop iteration.
At first glance for the first loop you could use the OpenMP reduction
clause (i.e., reduction(max :maxVal)
), which from the standard one can read:
The reduction clause can be used to perform some forms of recurrence
calculations (...) in parallel. For parallel and work-sharing
constructs, a private copy of each list item is created, one for each
implicit task, as if the private clause had been used. (...) The
private copy is then initialized as specified above. At the end of the
region for which the reduction clause was specified, the original list
item is updated by combining its original value with the final value
of each of the private copies, using the combiner of the specified
reduction-identifier.
For a more detailed explanation on how the reduction clause works have a look a this SO Thread.
Notwithstanding, you are updating two variables, namely maxVal
and current
. Hence, making it harder to solve those dependencies with the reduction clause alone. Nonetheless, one approach is to create a shared data structure among the threads, where each thread updates a given position of that shared structure. At the end of the parallel region, the master thread update the original values of maxVal
and current
, accordingly.
So instead of:
#pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(openSet, maxVal, current, fScores)
for(i = 0;i < openSet.size();i++){
if(fScores[openSet[i].x * dim + openSet[i].y] < maxVal){ // <-- you meant '>' not '<'
#pragma omp ordered
maxVal = fScores[openSet[i].x * dim + openSet[i].y];
current = openSet[i];
}
}
you could try the following:
int shared_maxVal[tc] = {INT32_MAX};
int shared_current[tc] = {0};
#pragma omp parallel num_threads(tc)
{
int threadID = omp_get_thread_num();
#pragma omp for shared(openSet, fScores)
for(int i = 0;i < openSet.size();i++){
if(fScores[openSet[i].x * dim + openSet[i].y] > shared_maxVal[threadID]){
shared_maxVal[threadID] = fScores[openSet[i].x * dim + openSet[i].y];
shared_current[threadID] = openSet[i];
}
}
}
for(int i = 0; i < tc; i++){
if(maxVal < shared_maxVal[i]){
maxVal = shared_maxVal[i];
current = shared_current[i];
}
}
For your second loop:
#pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(neighbours, openSet, gScores, fScores, tentative_gScore)
for(i = 0;i < neighbours.size();i++){
#pragma omp ordered
tentative_gScore = gScores[current.x * dim + current.y] + 1;
if(tentative_gScore < gScores[neighbours[i].x * dim + neighbours[i].y]){
cameFrom[neighbours[i].x * dim + neighbours[i].y] = current;
gScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore;
fScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore + hScore(); //(p.x, p.y, xEnd, yEnd)
if(contains(openSet, neighbours[i]) == false){
openSet.push_back(neighbours[i]);
}
}
}
Some of the aforementioned advice still holds. Moreover, do not make the variable tentative_gScore
shared among threads. Otherwise, you need to guarantee mutual exclusion on the accesses to that variable. As it is your code has a race-condition, namely threads may update the variable tentative_gScore
while other threads are reading it. Simply declare the tentative_gScore
variable inside the loop so that it is private to each thread.
Assuming that different threads cannot access to the same position of the arrays cameFrom
, gScores
and fScores
, the next thing you need to do is to create an array of openSet
s, and assign each position of that array to a different thread. In this manner, threads can update their respective positions without having to use some synchronization mechanism.
At the end of the parallel region merge the shared structure to the same (original) openSet
.
Your second loop might loop like the following:
// Create an array of "openSets" let us named "shared_openSet"
#pragma omp parallel num_threads(tc)
{
int threadID = omp_get_thread_num();
#pragma omp for shared(neighbours, gScores, fScores)
for(int i = 0;i < neighbours.size();i++){
// I just assume the type in but you can change if for the real type
int tentative_gScore = gScores[current.x * dim + current.y] + 1;
if(tentative_gScore < gScores[neighbours[i].x * dim + neighbours[i].y]){
cameFrom[neighbours[i].x * dim + neighbours[i].y] = current;
gScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore;
fScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore + hScore();
if(contains(openSet, neighbours[i]) == false){
shared_openSet[threadID].push_back(neighbours[i]);
}
}
}
}
// merge all the elements from shared_openSet into openSet.