code that may not be directly parallelized by OpenMP

Question

Working on autoparallelizer and looking for benchmarks that may not be directly parallelized by OpenMP ( "directly parallelized" meaning: to create parallel executable without modifying the code, just by specifying proper OpenMP directives ).

Being on the subject, is it possible to parallelize the following code using OpenMP?

1.

double a[ARRAY_DIM], c[ARRAY_DIM];
  .
double ret;
ret = 0.;
for ( i = 0; i < ARRAY_DIM; i++ ) {
  c[i] = exp( ret );
  ret += a[i];
}
return ret;

2.

double a[ARRAY_DIM], c[ARRAY_DIM];
   .
double ret;
ret = 0.;
for ( i = 0; i < ARRAY_DIM; i++ ) {
  if ( a[i] > 0.01 ) {
    ret = c[i];
  }
}
return ret;

Yes, by adding the proper #pragma omp, you don't have to change your code. You can optimize further by changing your code but (usually) adding #pragma is already giving improvement. — supercheval, Jul 30 '18 at 15:03
Neither of your code examples are easily to parallelize with OpenMP. That does not mean it can't be done but it can't be done with one pragma statement. The first case is doing a cumulative sum/prefix sum. The second case requires knowing the last iteration to modify `ret` i.e. the result is dependent on `i`. — Z boson, Jul 31 '18 at 07:29
Actually the second case can done with a single pragma using the `ordered` clause. — Z boson, Jul 31 '18 at 07:38
Why did you reinsert `.` into your code? Also why do you insist on having `return` when you did not even define a function? — Z boson, Aug 03 '18 at 09:22
. between declaration and code shows that there is a "distance" between them ( .e.g, they are in different routines ); return shows that 'ret' is needed outside of a loop where it is set. — David Livshin, Aug 05 '18 at 01:27

Z boson · Answer 1 · 2018-08-01T07:16:46.257

0

Code example 1 can be parallelized with OpenMP using a prefix sum solution.

For code example 2 I have two solutions in the code bellow called foo1 and foo2. The function foo1 is easier to implement but is probably much less efficient if a[i] > 0.01 is frequently true. For example if it's always true then it has to write to ret every iteration in order which kills parallelization. For the function foo2 it only has to write in order per thread and not per iteration and since the number of iterations is much greater than the number of threads this is much more efficient.

#include <omp.h>
#include <stdio.h>
#define ARRAY_DIM 1000
double a[ARRAY_DIM], c[ARRAY_DIM];

double foo1() {
  double ret = 0;
  #pragma omp parallel for ordered schedule(static)
  for (int i = 0; i < ARRAY_DIM; i++) {
    if ( a[i] > 0.01 ) {
      #pragma omp ordered
      ret = c[i];
    }
  }
  return ret;
}

double foo2() {
  int k = -1;
  #pragma omp parallel
  {
    int k2 = -1;
    #pragma omp for schedule(static)
    for (int i = 0; i < ARRAY_DIM; i++) if ( a[i] > 0.01 ) k2 = i;
    #pragma omp for schedule(static) ordered
    for(int i=0; i<omp_get_num_threads(); i++)
      #pragma omp ordered
      if(k2 >= 0) k = k2;
  }
  return c[k];
}

int main(void) {
  for(int i=0; i<ARRAY_DIM; i++) a[i] = 1;
  for(int i=0; i<ARRAY_DIM; i++) c[i] = i;
  printf("%f\n", foo1());
  printf("%f\n", foo2());
}

edited Aug 01 '18 at 07:16

answered Jul 31 '18 at 08:08

Z boson

32,619
11
123
226

Thank you. Solutions you suggested for example 1 and foo2 require modifications to the original code. The deficiency of foo1 you pointed yourself. – David Livshin Jul 31 '18 at 09:15
@DavidLivshin Your original text did not state that you required the code not to be modified. I provided an answer because I found a solution which does not require changing the code (`foo1`). That answers your question. I provided `foo2` to show hot to do it better. `foo1` is good enough if `a[i] > 0.01` is usually false. I don't know if better solutions which don't require changing the code. – Z boson Jul 31 '18 at 11:58
@DavidLivshin I'm skeptical to your claims that "The two examples I listed can be treated without any effort by the autoparallelizer I wrote." I'm especially skeptical that your autoparallelizer would provided better performance in these cases. – Z boson Jul 31 '18 at 12:03
The parallelizer I wrote ( dco ) treats the listed code samples with no effort; actually it is able to parallelize much more complicated cases. See an appropriate section in www.dalsoft.com. The site will be updated soon, I am looking for interesting benchmarks to add to already processed ( but not yet published on the site ) - that is why this posting. – David Livshin Jul 31 '18 at 12:19
@DavidLivshin, interesting project. How does `dco` handle case one. Do you see any performance improvement in either case? – Z boson Jul 31 '18 at 12:25
@DavidLivshin, why don't you update your question with actual benchmarks from `dco` compared to the OpenMP solutions. That would be interesting and could be an interesting question to see if an OpenMP solution exists which is better. – Z boson Jul 31 '18 at 12:28
Actually the case one is published on the site - search for "not parallel on OpenMP" on www.dalsoft.com/documentation_auto_parallelization.html. As to comparing with OpenMP - my goal is to complement OpenMP by parallelizing code that OpenMP doesnt and not to compete with OpenMP. – David Livshin Jul 31 '18 at 12:40
As you have asked the same question multiple places, I won't repeat my full answer here. I did have in mind the prefix sum analogy when I disparaged it. I'm surprised I don't see mentioned that the 2nd case should be reversed so as to break when the first match is found, unless you have been assigned a task to make parallel regardless of efficiency. – tim18 Jul 31 '18 at 12:51
Yes, that is exactly what I like to find out - can ( the 2nd case ) be parallelized by OpenMP and not so much what would be an optimal implementation for this algorithm ( which is, BTW, nothing more than an example ). Also, using a break in the body of the loop causes compiler ( gcc ) to fail to process it for OpenMP. ( the "exit" from a loop is another case that autoparallelizer I wrote can handle, but that is for another posting ). – David Livshin Jul 31 '18 at 13:12
"Yes, that is exactly what I like to find out - can ( the 2nd case ) be parallelized by OpenMP". That's exactly what `foo1` does. I don't think you have worded your question correctly. It probably should be closed. – Z boson Jul 31 '18 at 13:23

code that may not be directly parallelized by OpenMP

1 Answers1