0

I have some code, that parallel calculate the sums of some array's prefixes (ex. out_arr[0] = in_arr[0], out_arr[1] = in_arr[0]+in_arr[1] .. etc). My code has the N threads, there N is a number of in_arr elements and each thread process only 1 element of array. This is not good solution, so I want to process N/num_of_threads in each thread, but I've failed.

I tried to create shared variable with N/num_of_threads value and organize for cycle with this variable right of behind 1st #pragma directive, but I couldn't debug those magic numbers in stdout.

This is working version of «bad» solution:

void CalcSum2(int a[], int s[], int n) { 
    int* old = new int [n], *cnt = new int [n]; 
    #pragma omp parallel num_threads(N) {
    int i = omp_get_thread_num(), d = 1; 
    s[i] = a[i]; 
    cnt[i] = 1; 
     #pragma omp barrier 
    while (d < n) { 
        old[i] = s[i]; 
     #pragma omp barrier 
         if (i >= d) { 
             s[i] += old[i-d]; 
         cnt[i]++; 
         } 
         d += d; 
     #pragma omp barrier 
    }
    }
    delete[] old; delete[] cnt; 
    return; 
} 
zerospiel
  • 632
  • 8
  • 20

2 Answers2

1

The way you parallel the scan uses too many barriers which can hurt the performance.

Parallel scan on multi-core CPU is not very efficient because the number of sum operations increases from n-1 to about 2n. So the time cost is 2n/m, where m is the number of CPU cores.

To reduce the number of barriers, you could first do sequential scan on each segment of the data, then add a proper offset to each segment result. The following code demos the idea. It got 2.4x speed up on 8-core CPU when len is 1G. You could still improve the second part to get higher performance.

inline void scan(int a[], int s[], int len)
{
    int sum=0.0;
    for(int i=0;i<len;i++) {
        sum+=a[i];
        s[i]=sum;
    }
}

void ParallelScan(int a[], int s[], int len)
{
    int nt;
    int seglen, subseglen;
    int* segsum;
    #pragma omp parallel
    {
        #pragma omp single
        {
            nt = omp_get_num_threads();
            seglen = (len+nt-1)/nt;
            subseglen = (seglen+nt-1)/nt;
            segsum = new int[nt];
        }
        int tid = omp_get_thread_num();
        int start = seglen*tid;
        int end = seglen*(tid+1);
        end = end > len ? len : end;

        scan(&a[start],&s[start],end-start);
        segsum[tid]=s[end-1];
        #pragma omp barrier

        #pragma omp single
        for(int i=1; i<nt; i++) {
            segsum[i]+=segsum[i-1];
        }

        for(int segid=1; segid<nt; segid++) {
            int segstart=seglen*segid;
            int start = segstart + subseglen*tid;
            int end = start + subseglen;
            end = end > len ? len : end;
            end = end > segstart+seglen ? segstart+seglen : end;

            int offset = segsum[segid-1];
            for(int i=start; i<end; i++) {
                s[i]+=offset;
            }
        }
    }


    delete[] segsum;
}
kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • Is that really an 8 core CPU or is it 4 cores and 8 hyper threads? What CPU is it? – Z boson Oct 18 '13 at 11:06
  • @redrum 8 core. I changed the data type to `int` and got a even worse result. – kangshiyin Oct 18 '13 at 12:04
  • You changed the length size to 1GB which is main memory bound. That's likely the reason for the drop from 3.5 to only 2.5 faster. Since the best is 2n/m and you have 8 cores the best you can expect is 4 times faster. You got 3.5 originally. That's not bad. – Z boson Oct 18 '13 at 13:10
  • +1 for stating the time cost `2n/m` and for suggestion the 2nd order correction. – Z boson Oct 21 '13 at 09:13
  • sorry to belabor this point but you're time cost formula `2n/m` is only true if the second pass is not vectorized. If it is then the time cost will be closer to `n/m` but still greater than `n/m`. – Z boson Oct 22 '13 at 06:44
  • That's an rough estimation of the total number of operation 'plus', of course you can use Vectorization to reduce the number of instructions and clock cycles. – kangshiyin Oct 23 '13 at 08:55
1

You're doing a cumulative sum. Also know as a prefix sum. This can be done in parallel with OpenMP. I solved this problem recently with OpenMP Parallel cumulative (prefix) sums in OpenMP: communicating values between threads

You have to run over the array twice in parallel. The first time does partial sums and the second time corrects the partial sums with an offset.

I converted your code to to this for you below. As as test I did the sum of the counting number which has a closed form solution of i*(i+1)/2. You can see that the prefix_sum function get's the right answer.

#include <stdio.h>
#include <omp.h>

void prefix_sum(int a[], int s[], int n) {
    int *suma;
    #pragma omp parallel
    {
        const int ithread = omp_get_thread_num();
        const int nthreads = omp_get_num_threads();
        #pragma omp single
        {
            suma = new int[nthreads+1];
            suma[0] = 0;
        }
        int sum = 0;
        #pragma omp for schedule(static) nowait // do partial sum in parallel
        for(int i=0; i<n; i++) {
            sum += a[i];
            s[i] = sum;
        }
        suma[ithread+1] = sum;
        #pragma omp barrier
        int offset = 0;
        for(int i=0; i<(ithread+1); i++) {
            offset += suma[i];
        }

        #pragma omp for schedule(static) //run over array again in parallel for full sum
        for(int i=0; i<n; i++) {
            s[i] += offset;
        }
    }
    delete[] suma;
}

int main() {
    const int n = 100;
    int *a = new int[n];
    int *s = new int[n];
    for(int i=0; i<n; i++) {
        a[i] = i;
    }
    prefix_sum(a, s, n);
    for(int i=0; i<n; i++) {
        printf("%d ", s[i]);
    } printf("\n");

    for(int i=0; i<n; i++) {
        printf("%d ", i*(i+1)/2);
    } printf("\n");
}

Edit One of the problems of this method is that for large arrays most of the values have been evicted from the cache by the time the second pass starts. I came up with a solution which runs over a chunk in parallel and then moves on to the next chunk sequentially. I set the chunck_size to the level-2 cache (actually times four due to having four cores). This gives a big improvement for larger arrays. Here is an outline of the function. The complete function can be found in my answer at simd-prefix-sum-on-intel-cpu.

void scan_omp_SSEp2_SSEp1_chunk(float a[], float s[], int n) {
    float *suma;
    const int chunk_size = 1<<18;
    const int nchunks = n%chunk_size == 0 ? n / chunk_size : n / chunk_size + 1;    
    #pragma omp parallel
    {
        //initialization code 
        for (int c = 0; c < nchunks; c++) {
            const int start = c*chunk_size;
            const int chunk = (c + 1)*chunk_size < n ? chunk_size : n - c*chunk_size; 
            //pass1: pass1_SSE(&a[start], &s[start], chunk);                
            //get offset
            //pass2: pass2_SSE(&s[start], offset, chunk);
        }
    }
    delete[] suma;
}
Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • fewer barriers can reduce the thread overhead, but when `ithreads` is 0, `offset` is 0, you could save `n/nthreads` unnecessary addition operations to improve the performance, especially when nthreads is 1 or 2. – kangshiyin Oct 18 '13 at 10:53
  • That's interesting. Let me think about that. – Z boson Oct 18 '13 at 11:15
  • By doing n/m less additions the time cost becomes `2n/m - n/m^2` (where m=nthreads). Compared to `2n/m` that's a `1/(2m)` decrease in the time. For m=2 that's 25%, for m=3 16%, m=4 12%. For two cores systems it might be worth worrying about but then the improvement is still at best 25%. Probably the extra code would wash it out. Parallel prefix sums on a two core system are probably not worth doing. I would have to change my code significantly to do this with OpenMP as well. – Z boson Oct 18 '13 at 11:34