You're doing a cumulative sum. Also know as a prefix sum. This can be done in parallel with OpenMP. I solved this problem recently with OpenMP Parallel cumulative (prefix) sums in OpenMP: communicating values between threads
You have to run over the array twice in parallel. The first time does partial sums and the second time corrects the partial sums with an offset.
I converted your code to to this for you below. As as test I did the sum of the counting number which has a closed form solution of i*(i+1)/2
. You can see that the prefix_sum function get's the right answer.
#include <stdio.h>
#include <omp.h>
void prefix_sum(int a[], int s[], int n) {
int *suma;
#pragma omp parallel
{
const int ithread = omp_get_thread_num();
const int nthreads = omp_get_num_threads();
#pragma omp single
{
suma = new int[nthreads+1];
suma[0] = 0;
}
int sum = 0;
#pragma omp for schedule(static) nowait // do partial sum in parallel
for(int i=0; i<n; i++) {
sum += a[i];
s[i] = sum;
}
suma[ithread+1] = sum;
#pragma omp barrier
int offset = 0;
for(int i=0; i<(ithread+1); i++) {
offset += suma[i];
}
#pragma omp for schedule(static) //run over array again in parallel for full sum
for(int i=0; i<n; i++) {
s[i] += offset;
}
}
delete[] suma;
}
int main() {
const int n = 100;
int *a = new int[n];
int *s = new int[n];
for(int i=0; i<n; i++) {
a[i] = i;
}
prefix_sum(a, s, n);
for(int i=0; i<n; i++) {
printf("%d ", s[i]);
} printf("\n");
for(int i=0; i<n; i++) {
printf("%d ", i*(i+1)/2);
} printf("\n");
}
Edit
One of the problems of this method is that for large arrays most of the values have been evicted from the cache by the time the second pass starts. I came up with a solution which runs over a chunk in parallel and then moves on to the next chunk sequentially. I set the chunck_size to the level-2 cache (actually times four due to having four cores). This gives a big improvement for larger arrays. Here is an outline of the function. The complete function can be found in my answer at simd-prefix-sum-on-intel-cpu.
void scan_omp_SSEp2_SSEp1_chunk(float a[], float s[], int n) {
float *suma;
const int chunk_size = 1<<18;
const int nchunks = n%chunk_size == 0 ? n / chunk_size : n / chunk_size + 1;
#pragma omp parallel
{
//initialization code
for (int c = 0; c < nchunks; c++) {
const int start = c*chunk_size;
const int chunk = (c + 1)*chunk_size < n ? chunk_size : n - c*chunk_size;
//pass1: pass1_SSE(&a[start], &s[start], chunk);
//get offset
//pass2: pass2_SSE(&s[start], offset, chunk);
}
}
delete[] suma;
}