3

I'm trying to parallelize a loop with interdependent cycles, I've tried with a reduction and the code work, but the result is wrong, I think that the reduction works for the sum but not for the update of the array in the right cycle, there is a way to obtain the right result parallelizing the loop?

#pragma omp parallel for reduction(+: sum)
for (int i = 0; i < DATA_MAG; i++)
{
    sum += H[i];

    LUT[i] = sum * scale_factor;
}
CIVI89
  • 91
  • 9
  • 1
    You can't use the value you are reducing while you are reducing it i.e. ` LUT[i] = sum * scale_factor;` won't work. – Z boson Apr 19 '16 at 07:20
  • You need to use a pre-fix sum method. Unfortunately, the parallel prefix sum is memory bandwidth bound so I don't hink it will help you much. – Z boson Apr 19 '16 at 07:21
  • Ok...I suspected it, so there is no way? – CIVI89 Apr 19 '16 at 07:24
  • 1
    It can be done but it's not easy and I am not sure it's worth it. What you could do is use SIMD prefix if `DATA_MAG` is not too large. The problem with using OpenMP is that if `DATA_MAG` is too small the OpenMP overhead dominates but if `DATA_MAG` is too large it's memory bandwidth bound. There may be some sweet stop between e.g. the L2 and L3 cache where you see some benefit using threads though. What is the range of `DATA_MAG`? – Z boson Apr 19 '16 at 07:27
  • DATA_MAG is 256...anyway I understand your point. In many cases is not the case to add openMP, because of performance loss. – CIVI89 Apr 19 '16 at 07:31
  • 1
    `256` is tiny!!! If this is really critical use [prefix SIMD](https://stackoverflow.com/questions/19494114/parallel-prefix-cumulative-sum-with-sse/19519287#19519287). I don't think you can use OpenMP's `simd` construct for this. You need to use intrinsics or assembly. There is no way OpenMP `simd` or auto-vectorization is going to figure this out. – Z boson Apr 19 '16 at 07:31
  • What is your CPU? What is your compiler? Is this Linux and GCC? What does `cat /proc/cpuinfo` say? – Z boson Apr 19 '16 at 07:35
  • The cpu is an I5 dual core Haswell, compiler gcc 5 on OSX. – CIVI89 Apr 19 '16 at 07:36

1 Answers1

2

The reduction clause creates private copies of sum for each thread in the team as if the private clause had been used on sum. After the for loop the result of each private copy is combined with the original shared value of sum. Since the shared sum is only updated after the for loop you cannot rely on it inside of the for loop.

In this case you need to do a prefix sum. Unfortunately the parallel prefix-sum using threads is memory bandwidth bound for large DATA_MAG and it's dominated by the OpenMP overhead for small DATA_MAG. However, there may be some sweet spot in between small and large where you seen some benefit using threads.

But in your case DATA_MAG is only 256 which is very small and would not benefit from OpenMP anyway. What you can do is use SIMD. However, OpenMP's simd construction is not powerful enough for the prefix sum as far as I know. However, you can do this by hand suing intrinsics like this

#include <stdio.h>
#include <x86intrin.h>

#define N 256

inline __m128 scan_SSE(__m128 x) {
  x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
  x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
  return x;
}

void prefix_sum_SSE(float *a, float *s, int n, float scale_factor) {
  __m128 offset = _mm_setzero_ps();
  __m128 f = _mm_set1_ps(scale_factor);
  for (int i = 0; i < n; i+=4) {
    __m128 x = _mm_loadu_ps(&a[i]);
    __m128 out = scan_SSE(x);
    out = _mm_add_ps(out, offset);
    offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3));
    out = _mm_mul_ps(out, f); 
    _mm_storeu_ps(&s[i], out);
  }
}

int main(void) {
  float H[N], LUT[N];
  for(int i=0; i<N; i++) H[i] = i;
  prefix_sum_SSE(H, LUT, N, 3.14159f);
  for(int i=0; i<N; i++) printf("%.1f ", LUT[i]); puts("");
  for(int i=0; i<N; i++) printf("%.1f ", 3.14159f*i*(i+1)/2); puts("");

}

See here for more details about SIMD pre-fix sum with SSE and AVX.

Z boson
  • 32,619
  • 11
  • 123
  • 226