1

I have this code:

#pragma omp declare reduction(* : scalar : omp_out *= omp_in)     
scalar like=1;
vector<scalar>& bigsum;
#pragma omp parallel for // reduction(* : like)    
for (int m = 0; m < M-1; m++)
   like *= bigsum[m];

I am trying to get a consistent result but it doesn't (race condition problem), but how should I fix it? as it is visible in the code I have my own reduction function but it doesn't work either. Is there any trick for scalar and std::vector that I should be aware of?

Scalar variable in here is just overridden floating point by apply log() on each double that I has been created since there are so many double to double multiplications and the result after couple of them becomes close to zero. For example by doing the log() then multiplication becomes adding and etcetera.

One answer for consistent answer would be this:

    #pragma omp parallel
    {
       scalar loc = 1;
    #pragma omp for
       for (std::size_t m = 1; m < M;m++)
       {
          _flm[m-1] = Qratio[m-1] * k1 + k2;
          bigsum[m-1] = kappa0omegaproduct + kappa[1] * _flm[m-1];
    #pragma omp critical (reduce_product)
          {
             like *= bigsum[m-1];
          }
       }
}

This answer is correct but so slow it is almost 8 times slower on my 8 core machine!

M.Rez
  • 1,802
  • 2
  • 21
  • 30
  • What type is `scalar`? What kind of race condition do you observe. – Zulan Jul 17 '16 at 16:29
  • I am dealing with very large double numbers. When I multiply them a lot they become too close to zero, so I am using this scalar type which is overloaded operations on log of those small doubles. – M.Rez Jul 17 '16 at 17:04
  • 1
    Anyway, please provide a [mcve]. – Zulan Jul 17 '16 at 18:23
  • Do you not see a conflict between your desire OTOH to ensure that every operation happens in a precisely determined order, and, OTOH to allow multiple operations to happen in parallel (so without any specification of the order between them) ? – Jim Cownie Jul 18 '16 at 14:23
  • I updated my answer according to the comments. thanks for great comments. – M.Rez Jul 18 '16 at 16:41
  • 1
    The issue is probably not in the code you show. Seriously, you are a long time experienced SO member, why are you so reluctant to provide a [mcve]? – Zulan Jul 18 '16 at 17:01

2 Answers2

2

I have an answer myself after three days and an explanation for what I have found.

I have created my own reduction function like this:

#pragma omp declare reduction(* : scalar : omp_out *= omp_in) initializer (omp_priv (1))

The trick was the omp_priv and apparently the reduction value initialization is important, somethin I learned in here.

I made the code much simpler by applying openmp for loops like this:

#pragma omp parallel for reduction (* : like)

Very simple and clean. In this new way the code gets parallelized and runs faster than what I had in the question body. Unfortunately, it is still slower than the serial version. Maybe it is because of the std::vector usage, or the overloaded arithmetic operations are very slow!? I do not know. The code is so large I cannot copy paste it in here in a way that it would be understandable, and not be a pain in the neck to read for others.

macrocosme
  • 473
  • 7
  • 24
M.Rez
  • 1,802
  • 2
  • 21
  • 30
1

Your initial example relies on the native OpenMP reduction. Are you sure that the user-defined OpenMP reduction works correctly? It might be an issue of the overloaded operator.

Your second example is correct (you said it) but very slow because of the critical section. Why don't you implement manually the OpenMP reduction, by having a local variable (e.g. "local_like") per thread and then using a critical section after "omp for"?

phadjido
  • 394
  • 3
  • 9
  • Yes I should update my reduction function. First I have to debug to see why some of the overloaded functions are not being called! thanks for mentioning it. – M.Rez Jul 19 '16 at 10:12
  • When I did it with local variables I couldn't see the needed speed up if you check my own answer, I have done reduction and it is almost as 10 times faster than the code I shared in the question with my 8 core machine. – M.Rez Jul 20 '16 at 09:07