I am implementing the prefix sums problem in OpenMP and I don't seem to get any speedup. Actually, the parallel implementation is taking longer than the sequential one.
Here is my code for the prefix sums:
for (k = 1; k < n; k = kk) {
kk = k << 1;
#pragma omp parallel for
for (i = kk - 1; i < n; i += kk) {
x[i] = x[i-k] + x[i];
}
}
for (k = k >> 1; k > 1; k = kk) {
kk = k >> 1;
#pragma omp parallel for
for (i = k - 1; i < n - kk; i += k) {
x[i + kk] = x[i] + x[i + kk];
}
}
I compiled this using gcc -fopenmp -O3 prefix_sums.c. The results that I get for 1 000 000 integers are:
for the sequential implementation (compiled also with -O3):
0.001132
0.000929
0.000872
0.000865
0.000842
for the parallel implementation (5 re-runs on 4 cores):
0.025851
0.005493
0.006327
0.007092
0.030720
Could someone explain me what the problem could be? The implementation is giving the correct output, but why does it take so long?
Thank you.