2

I am implementing the prefix sums problem in OpenMP and I don't seem to get any speedup. Actually, the parallel implementation is taking longer than the sequential one.

Here is my code for the prefix sums:

for (k = 1; k < n; k = kk) {
    kk = k << 1;

    #pragma omp parallel for 
    for (i = kk - 1; i < n; i += kk) {
        x[i] = x[i-k] + x[i];
    }
 }

for (k = k >> 1; k > 1; k = kk) {
    kk = k >> 1;

    #pragma omp parallel for
    for (i = k - 1; i < n - kk; i += k) {
        x[i + kk] = x[i] + x[i + kk];
    }
}

I compiled this using gcc -fopenmp -O3 prefix_sums.c. The results that I get for 1 000 000 integers are:

for the sequential implementation (compiled also with -O3):

0.001132
0.000929
0.000872
0.000865
0.000842

for the parallel implementation (5 re-runs on 4 cores):

0.025851
0.005493
0.006327
0.007092
0.030720

Could someone explain me what the problem could be? The implementation is giving the correct output, but why does it take so long?

Thank you.

pixie
  • 507
  • 9
  • 21

2 Answers2

6

Prefix sum can be made parallel both for MIMD (e.g. with OpenMP) and with SIMD (e.g. with SSE/AVX).

It's a bit of a pain with OpenMP to do the prefix sum but it's not too bad. I already went into extensive detail on this simd-prefix-sum-on-intel-cpu and here parallel-cumulative-prefix-sums-in-openmp-communicating-values-between-thread

Edit: You're doing the prefix sum in-place (in situ). The links above do it not-in-place (ex situ). I modified the code (see below) to do the prefix sum in-place as you're doing and tested it. You will probably need more than two physical cores to see any benfit.

Basically you do this in two passes. In the first pass you do partial sums and then in the second pass you correct the partial sums with a constant for each partial sum. The second pass will be vectorized by a good compiler (e.g. with GCC but not with MSVC). It's also possible to use SIMD on the first pass as well but no compiler I have used will vectorize that so you have to do it yourself with intrinsics.

The algorithm goes as O(n) so it quickly becomes memory bound and not compute bound. This means for arrays which only fit in the L1 cache the OpenMP overhead is too large. In this case it's better just to use SIMD (which has no overhead). For larger arrays you can benefit from both SIMD and MIMD but at some point the algorithm becomes memory bound and it's not much faster than a non-parallel algorithm.

#include <stdio.h>
#include <omp.h>

void prefixsum_inplace(float *x, int N) {
    float *suma;
    #pragma omp parallel
    {
        const int ithread = omp_get_thread_num();
        const int nthreads = omp_get_num_threads();
        #pragma omp single
        {
            suma = new float[nthreads+1];
            suma[0] = 0;
        }
        float sum = 0;
        #pragma omp for schedule(static)
        for (int i=0; i<N; i++) {
            sum += x[i];
            x[i] = sum;
        }
        suma[ithread+1] = sum;
        #pragma omp barrier
        float offset = 0;
        for(int i=0; i<(ithread+1); i++) {
            offset += suma[i];
        }
        #pragma omp for schedule(static)
        for (int i=0; i<N; i++) {
            x[i] += offset;
        }
    }
    delete[] suma;
}

int main() {
    const int n = 20;
    float x[n];
    for(int i=0; i<n; i++) x[i] = 1.0*i;
    prefixsum_inplace(x, n);
    for(int i=0; i<n; i++) printf("%f %f\n", x[i], 0.5*i*(i+1));
}
Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
0

Since each element depends on the previous element you would have to break up the algorithm in two steps. Each thread would only calculate the prefix on a subset of the integers in the first step (so that each thread doesn't not have a dependency on any other thread) and would add the result of the other threads that are relevant.

For example: x[3] depends on x[0], x[1], x[2] and x[3]. You could split the calculation of x[4] in two subsets. Let one thread calculate x[1] by adding 1 and 2, and let the second thread sum 3 and 4 into x[4]. After this step the threads have to synchronize (which openMP does does for you if you begin a second parallel loop) and the second thread would than calculate the final answer by adding the x[2] to x[4]. If you have lots of integers and lots of threads it might even be beneficial to do break down the calculation in three steps.

This basically is a parallel reduction, which can be used for parallelizing most (?) iterative algorithms. On DrDobbs the theory and some images are given on what parallel reduction precisely is.

Ps: On closer examination of your algorithm it seems you implement the prefix sum problem fairly complex. It still has a lot of dependencies (which I did examine closely), but I think that my statements above are still valid and you could the parallel reduction. But I was wondering: did you implement a algorithm which was actually intended for creating hardware circuits?

Mark
  • 76
  • 3