Parallel cumulative (prefix) sums in OpenMP: communicating values between threads

Question

Assume I have a function f(i) which depends on an index i (among other values which cannot be precomputed). I want to fill an array a so that a[n] = sum(f(i)) from i=0 to n-1.

Edit: After a comment by Hristo Iliev I realized what I am doing is a cumulative/prefix sum.

This can be written in code as

float sum = 0;
for(int i=0; i<N; i++) {
    sum += f(i);
    a[i] = sum;
}

Now I want to use OpenMP to do this in parallel. One way I could do this with OpenMP is to write out the values for f(i) in parallel and then take care of the dependency in serial. If f(i) is a slow function then this could work well since the non-paralleled loop is simple.

#pragma omp parallel for
for(int i=0; i<N; i++) {
    a[i] = f(i);
}
for(int i=1; i<N; i++) {
    a[i] += a[i-1];
}

But it's possible to do this without the non-parallel loop with OpenMP. The solution, however, that I have come up with is complicated and perhaps hackish. So my question is if there is a simpler less convoluted way to do this with OpenMP?

The code below basically runs the first code I listed for each thread. The result is that values of a in a given thread are correct up to a constant. I save the sum for each thread to an array suma with nthreads+1 elements. This allows me to communicate between threads and determine the constant offset for each thread. Then I correct the values of a[i] with the offset.

float *suma;
#pragma omp parallel
{
    const int ithread = omp_get_thread_num();
    const int nthreads = omp_get_num_threads();
    const int start = ithread*N/nthreads;
    const int finish = (ithread+1)*N/nthreads;
    #pragma omp single
    {
        suma = new float[nthreads+1];
        suma[0] = 0;
    }
    float sum = 0;
    for (int i=start; i<finish; i++) {
        sum += f(i);
        a[i] = sum;
    }
    suma[ithread+1] = sum;
    #pragma omp barrier
    float offset = 0;
    for(int i=0; i<(ithread+1); i++) {
        offset += suma[i];
    }
    for(int i=start; i<finish; i++) {
        a[i] += offset;
    }
}
delete[] suma;

A simple test is just to set f(i) = i. Then the solution is a[i] = i*(i+1)/2 (and at infinity it's -1/12).

This is pretty much how prefix sums are usually calculated with OpenMP. Instead of manually computing start and finish indexes you could apply `#pragma omp for schedule(static)` to both loops that run over `a[]`. — Hristo Iliev, Sep 10 '13 at 16:10
@HristoIliev, I thought that although in practice OpenMP defines start and finish like I did I should not assume that OpenMP will do it that way (I thought I read that in one of your posts). The code `for(int i=0; i<(ithread+1); i++)` requires that in parallel loops that larger index values always correspond to larger thread values. Is this true in general? — Z boson, Sep 10 '13 at 17:06
`schedule(static)` has special properties guaranteed by the standard like repeatable distribution pattern under certain conditions (that are met in your case). — Hristo Iliev, Sep 10 '13 at 19:57
Okay, I think I understand. I made a SO question about it since I thought it's something others might want to know. I have been unsure about it for a while. — Z boson, Sep 11 '13 at 16:18

Massimiliano · Accepted Answer · 2013-09-10T16:12:00.967

5

You can extend your strategy to an arbitrary number of sub-regions, and reduce them recursively, using tasks:

#include<vector>
#include<iostream>

using namespace std;

const int n          = 10000;
const int baseLength = 100;

int f(int ii) {
  return ii;
}

int recursiveSumBody(int * begin, int * end){

  size_t length  = end - begin;
  size_t mid     = length/2;
  int    sum     = 0;


  if ( length < baseLength ) {
    for(size_t ii = 1; ii < length; ii++ ){
        begin[ii] += begin[ii-1];
    }
  } else {
#pragma omp task shared(sum)
    {
      sum = recursiveSumBody(begin    ,begin+mid);
    }
#pragma omp task
    {
      recursiveSumBody(begin+mid,end      );
    }
#pragma omp taskwait

#pragma omp parallel for
    for(size_t ii = mid; ii < length; ii++) {
      begin[ii] += sum;
    }

  }
  return begin[length-1];
}

void recursiveSum(int * begin, int * end){

#pragma omp single
  {
    recursiveSumBody(begin,end);
  }    
}


int main() {

  vector<int> a(n,0);

#pragma omp parallel
  {
    #pragma omp for
    for(int ii=0; ii < n; ii++) {          
      a[ii] = f(ii);
    }  

    recursiveSum(&a[0],&a[n]);

  }
  cout << n*(n-1)/2 << endl;
  cout << a[n-1] << endl;

  return 0;
}

edited Sep 10 '13 at 16:12

answered Sep 10 '13 at 15:22

Massimiliano

7,842
2
47
62

Thank you so much for posting a working example! I guess I was hoping for an answer that works with OpenMP 2.0 (so that it works in MSVC as well) but this is a good chance for me to use OpenMP tasks. I had to increase the `baseLength` to get the correct values for `n=10000`. Do you have any idea how efficient this method is? – Z boson Sep 10 '13 at 17:16
Well, I don't think that for this particular example tasks will be faster than the code you have written. What concerns me more is the fact you had to increase `baseLength` to get the correct value, which means there is a flaw somewhere. Anyhow I am not able to see a data race in the program. – Massimiliano Sep 10 '13 at 17:37
Well it appears that `baseLength` has to be equal to `n` to get the correct result. – Z boson Sep 10 '13 at 17:52
I am getting right results for whichever `baseLength` on my machine. Compiled with `g++ 4.8.1`. – Massimiliano Sep 10 '13 at 18:01
Strange, I don't know, I had to include to compile but that's it. I'm using G++ 4.7.3. – Z boson Sep 10 '13 at 19:32
`` was already included in the snippet. Anyhow I tried on [Coliru](http://coliru.stacked-crooked.com/a/88eb9f10ebbd1c5a) and I get the right result even there. – Massimiliano Sep 10 '13 at 20:05
I installed it on another system and it runs fine. I don't have access to the other system now. Now let me try and understand the code. Coliru is really cool! – Z boson Sep 11 '13 at 11:53

Maverick · Answer 2 · 2021-01-12T13:41:02.147

For completeness sake, I add the code of OP's MWE when Hristo's remark is taken into account:

#include <iostream>
#include <omp.h>
using std::cout;
using std::endl;

const int N = 10;
const int Nthr = 4;
float f(int i) {return (float)i;}

int main(void) {
    omp_set_num_threads(Nthr);
    float* a = new float[N];
    float *suma = new float[Nthr+1];
    suma[0] = 0.0;
    float sum = 0.0;
#pragma omp parallel for schedule(static) firstprivate(sum)
    for (int i=0; i<N; i++) {
        sum += f(i);
        a[i] = sum;
        suma[omp_get_thread_num()+1] = sum;
    }

    // this for-loop is also a commulative sum, but it has only Nthr iterations
    for (int i=1; i<Nthr;i++)
        suma[i] += suma[i-1];

#pragma omp parallel for schedule(static)
    for(int i=0; i< N; i++) {
        a[i] += suma[omp_get_thread_num()];
    }

    for (int i=0; i<N; i++) {
        cout << a[i] << endl;
    }

    delete[] suma;
    int n = 95;
    cout << a[n] << endl << n*(n+1)/2 << endl;
    delete[] a;
    return 0;
}

Parallel cumulative (prefix) sums in OpenMP: communicating values between threads

2 Answers2

Linked