Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f
does not use velocity
.
A simple #pragma omp parallel for
before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity
is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity
at the same time.
I think the first solution would be to write #pragma omp atomic
before the velocity[j] += f(j);
operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d
or velocity
being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce
clause (instead of the critical
section) throws a compiler error. I think this is because velocity
is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for
loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction
clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++
)