OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

Question

Given the following code...

for (size_t i = 0; i < clusters.size(); ++i)
{
    const std::set<int>& cluster = clusters[i];
    // ... expensive calculations ...
    for (int j : cluster)
        velocity[j] += f(j); 
}

...which I would like to run on multiple CPUs/cores. The function f does not use velocity.

A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.

I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.

I have come up with this:

#pragma omp parallel
{
    // these variables are local to each thread
    std::vector<Eigen::Vector3d> velocity_local(velocity.size());
    std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));

    #pragma omp for
    for (size_t i = 0; i < clusters.size(); ++i)
    {
        const std::set<int>& cluster = clusters[i];
        // ... expensive calculations ...
        for (int j : cluster)
            velocity_local[j] += f(j); // save results from the previous calculations
    } 

    // now each thread can save its results to the global variable
    #pragma omp critical
    {
        for (size_t i = 0; i < velocity_local.size(); ++i)
            velocity[i] += velocity_local[i];
    }
}

Is this a good solution? Is it the best solution? (Is it even correct?)

Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.

I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.

Edit: As request per comment: The reduction clause...

    #pragma omp parallel reduction(+:velocity)
    for (omp_int i = 0; i < velocity_local.size(); ++i)
        velocity[i] += velocity_local[i];

...throws the following error:

error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause

(similar error with g++)

Share the code using reduce that errors so fixes can be suggested. — Jeff Hammond, Jun 19 '15 at 17:05
Have you considered ppl? The code to write "self reducing data" is slick there, and doesn't have to be primitives. Basically you describe what the thread-load data is, and how to combine two thread-local datas, and it does the rest. — Yakk - Adam Nevraumont, Jun 19 '15 at 18:08
OpenMP doesn't know how to reduce over an STL container. I can't remember if simple arrays are supported. — Jeff Hammond, Jun 21 '15 at 01:02
@Yakk what is "ppl"? @Jeff Seems like OMP cannot reduce over class members (only local variables) and OMP also cannot reduce over arrays. Just scalar values (like `int`) are alowed for a reduce. — Micha, Jun 21 '15 at 11:57
parallel patterns library. Your error indicates you are compiling using MSVC most likely. — Yakk - Adam Nevraumont, Jun 21 '15 at 12:15
@Yakk It need this project to compile on both Windows (vc++) and Linux (g++) — Micha, Jun 21 '15 at 13:53
@Zboson How would that help? Please read my question again to see why parallelizing over this loop is not possible (tl;dr: write access to the same variable). Also this loop is not very epensive compared to the outer loop. — Micha, Jun 23 '15 at 11:41
When you do the reduction yourself you need to change `velocity[j] += f(j);` to `velocity_local[j] += f(j); — Z boson, Jun 23 '15 at 11:51
The only reason parallelizing the inner loop may not be possible is if the container (`std::set`) does not have random access. You also wrote `// ... expensive calculations ...` I guess you mean these are calculations your removed and you're not referring to the inner loop? — Z boson, Jun 23 '15 at 11:54
You an do this with or without a critical section [fill-histograms-array reduction-in-parallel-with-openmp-without-using-a-critic](http://stackoverflow.com/questions/16789242/fill-histograms-array-reduction-in-parallel-with-openmp-without-using-a-critic). — Z boson, Jun 23 '15 at 11:57
@Zboson Different threads may try to write so the same `velocity[i]` memory address. How would it be possible to account for this without any `critical`, `single` or `atomic` section? Please elaborate on this, you may also post an answer, which might be more suitable for further discussion. As for your last question: "expensive calculations" do most of the computational work. They are referred to in the loop over `cluster`s, as the results of the "expensive calculations" are written to the `velocity` variable in this loop. — Micha, Jun 23 '15 at 12:21
You were right about writing `velocity_local[j] += f(j);` in the second block of code. Thanks, I fixed it in the question. — Micha, Jun 23 '15 at 12:24
I forgot to add that you can add `nowait`. I mean you can do `#pragma omp for nowait` in your first loop. — Z boson, Jun 26 '15 at 12:32
@Zboson thanks, but that does not seem to affect performance. Isn't there an implicit barrier before the critical section anway? — Micha, Jul 02 '15 at 08:57
@Micha, `nowait` removes the implicit barrier (that's the whole point of `nowait`). The barrier is unnecessary in this case even if it does not effect performance in your case. At the very least it helps you understand what is being done. — Z boson, Jul 02 '15 at 09:05

score 1 · Accepted Answer · edited May 23 '17 at 12:17

You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.

You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.

std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
    const int nthreads = omp_get_num_threads();
    const int ithread = omp_get_thread_num();
    const int vsize = velocity.size();

    #pragma omp single
    velocitya.resize(vsize*nthreads);
    std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1), 
              Eigen::Vector3d(0,0,0));

    #pragma omp for schedule(static)
    for (size_t i = 0; i < clusters.size(); i++) {
        const std::set<int>& cluster = clusters[i];
        // ... expensive calculations ...
        for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
    } 

    #pragma omp for schedule(static)
    for(int i=0; i<vsize; i++) {
        for(int t=0; t<nthreads; t++) {
            velocity[i] += velocitya[vsize*t + i];
        }
    }
}

This method requires extra care/tuning due to false sharing which I have not done.

As to which method is better you will have to test.

OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

1 Answers1

Linked