2

I'm using OpenMP to parallelize the loop, that is internally using AVX-512 with Agner Fog's VCL Vector Class Library.

Here is the code:

double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);

  #pragma omp parallel for reduction(+:sumV,divV)
  for(i=0; i<N; ++i) {
    sumV += oneV / divV;
    divV += addV;
  }
  return horizontal_add(sumV);
}

When trying to compile the code above, I'm getting

g++ -Wall -Wextra -O3 -g -I include -fopenmp -m64 -mavx2 -mfma -std=c++17 -o harmonic_series harmonic_series.cpp
harmonic_series.cpp:87:40: error: user defined reduction not found for ‘sumV’
   87 |   #pragma omp parallel for reduction(+:sumV,divV)
      |                                        ^~~~
harmonic_series.cpp:87:45: error: user defined reduction not found for ‘divV’
   87 |   #pragma omp parallel for reduction(+:sumV,divV)

Any hints on how to solve this and provide the user-defined reduction for the Vec8d class? It's simply the plus operator which is defined by the VCL class, but I cannot find any example how to code this.

Thanks a lot for any help!

Jirka
  • 365
  • 2
  • 8
  • 1
    This answer: https://stackoverflow.com/questions/70678525/declare-user-defined-reduction-of-openmp-for-class-template helped to move forward. By adding #pragma omp declare reduction( + : Vec8d : omp_out = omp_out + omp_in ) initializer (omp_priv=omp_orig) I can make compiler happy - however, something is not right, I'm getting completely wrong sumV results. – Jirka Nov 22 '22 at 03:05

2 Answers2

1

I have found two solutions:

  1. Using #pragma omp declare reduction
double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);
  const Vec8d startdivV = divV;

  #pragma omp declare reduction( + : Vec8d : omp_out = omp_out + omp_in ) initializer (omp_priv=omp_orig)
  #pragma omp parallel for private(divV) reduction(+:sumV)
  for(i=0; i<N; ++i) {
    divV = startdivV + i * addV;
    sumV += oneV / divV;
  }
  return horizontal_add(sumV);
}

Drawback - there is some performance penalty since divisor has to be computed each time as

divV = startdivV + i * addV;

instead of incrementing it at each step by addV

divV += addV;

(multiplication is slower than addition).

Question: can I use omp parallel for private, but set divisor at start for each thread and then only increment it?

divV = startdivV + start_value_of_i * addV;
  for(i=0; i<N; ++i) {
    sumV += oneV / divV;
    divV += addV;
  }
  1. Use omp parallel, omp for, and omp critical to implement reduction manually. It gives me more control, but code is longer and more error prone. The implementation below also assumes that number of iterations is divisible by number of threads.
double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);

/*
Manual reduction
This code blindly assumes that N is divisible by number of threads
https://stackoverflow.com/questions/18746282/openmp-schedulestatic-with-no-chunk-size-specified-chunk-size-and-order-of-as
*/
  unsigned long long int start, end;
  int thread_num, num_threads;
  Vec8d localsumV;
  Vec8d localdivV;

  #pragma omp parallel private( i, thread_num, num_threads, start, end, localsumV, localdivV )
  {
    thread_num = omp_get_thread_num();
    num_threads = omp_get_num_threads();
    start = thread_num * N / num_threads;
    end = (thread_num + 1) * N / num_threads;
    localsumV = sumV;
    localdivV = start* addV + divV;
    for(i=start; i<end; ++i) {
      localsumV += oneV / localdivV;
      localdivV += addV;
    }
    #pragma omp critical
    {
      sumV += localsumV;
    }
  return horizontal_add(sumV);
}

I like the first version more.

Any hints how to improve the first version and increment divisor in each iteration, instead of compute it as divV = startdivV + i * addV; ?

Jirka
  • 365
  • 2
  • 8
  • 1
    It's actually not that multiplication itself is slower than addition; same throughput and latency as addition on Zen4 (3c latency, 1c throughput for ZMM). vmulpd and vaddpd don't even compete for the same execution ports. (And the multiply+add can hopefully contract into one FMA.) It's that u64->fp conversion has a cost (but at least with AVX-512 it's still a single instruction). Plus a SIMD vector-int add, or a scalar broadcast of the loop counter. Perhaps with `-ffast-math`, a compiler might strength-reduce `i*addV` to asm equivalent to the `divV +=` FP addition you had been using. – Peter Cordes Nov 22 '22 at 05:58
  • Do you need `reduction(+ : divV)` at all? Each thread needs its own version of it at some starting point, but it's just an induction variable, not a *reduction*; your code after the loop doesn't need its value. Combining the `declare reduction` stuff for Vec8d in general with just `#pragma omp parallel for reduction(+ : sumV)`, GCC compiles it. I don't know if the resulting asm is actually *correct*, though; I didn't write a test caller. https://godbolt.org/z/xY15db1Wq . The asm has the expected insns, but with some store/reload so it might be terrible. And IDK about the init values. – Peter Cordes Nov 22 '22 at 06:05
  • Ok, I read some more about OpenMP, and apparently you do need to eliminate induction variables like `divV` because that would produce a race. (https://courses.engr.illinois.edu/cs420/fa2012/foils/OpenMPPart2.pdf) I guess the design intent is that the compiler can strength-reduce back to an induction variable if that's best? – Peter Cordes Nov 22 '22 at 06:11
  • Hi Peter, thanks a lot for your comments. You are right, `reduction(+ : divV)` is not needed at all. I have removed it. For speed - you are right, that slowdown is not due to the addition vs FMA, but due to the u64->fp conversion. I have measured runtime with 12 threads and I got {real 9s, user 1m46s} for manual reduction, and {real 9.6s; user 1m52s} for automatic reduction with `divV = startdivV + i * addV;` I was able to achieve the same speed for automatic reduction by doing `divV = startdivV + i * addV;` only in first round and then using `divV += addV;` I will post final solution below. – Jirka Nov 22 '22 at 23:14
  • And it still gives correct results with `divV += addV;`? I'm not that familiar with OpenMP, but some stuff google found was saying that inter-iteration dependencies would create a "race". (Which seems weird to me; local vars aren't usually going to get stored/reloaded, but every thread needs to find the right starting value for it so maybe "race" is just a name for that.) – Peter Cordes Nov 22 '22 at 23:23
  • 1
    Please check the answer/code which I have just posted. Each thread will compute `divV = startdivV + i * addV` in the first iteration of the for loop (tracked by private variable `bool first_loop` for each thread) and only in subsequent iterations use `divV += addV`. This works since `divV` is private variable. In essence, I explicitly set the starting value of `divV` for each thread. – Jirka Nov 23 '22 at 00:52
  • Ah ok, that makes sense. Declaring it as `private` and getting it initialized is necessary; my godbolt link with naive code presumably was unsafe. Like in your new answer, or the code at the end of this answer. – Peter Cordes Nov 23 '22 at 03:18
  • I wonder if it would be slightly better to horizontal_sum locally before the `omp_critical` in the version in this answer, so the hsum shuffles could be running in parallel with waiting for the lock or whatever. And if we're lucky, using a `lock cmpxchg` retry loop for a lock-free sum of a scalar `double`. Maybe irrelevant for any array size worth parallelizing, unless a lock-free reduction actually helps (or hurts; it might be worse than normal locking if they all tend to finish at once) – Peter Cordes Nov 23 '22 at 03:21
1

Here is the final solution. It's using automatic reduction and avoids u64->fp conversion by computing divV = startdivV + i * addV only in the first iteration of each loop and then using divV += addV for all other iterations. Runtime to compute the sum of first 9.6e10 elements is {real 9s, user 1m46s} with 12 threads on Intel Core i7-10850H CPU - it's the same as for the manual reduction.

double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);
  const Vec8d startdivV = divV;
  bool first_loop = true;
  #pragma omp declare reduction( + : Vec8d : omp_out = omp_out + omp_in ) initializer (omp_priv=omp_orig)
//It's important to mark "first_loop" variable as firstprivate so that each private copy gets initialized.
  #pragma omp parallel for firstprivate(first_loop) lastprivate(divV) reduction(+:sumV)
  for(i=0; i<N; ++i) {
    if (first_loop) {
      divV = startdivV + i * addV;
      first_loop = false;
    } else {
      divV += addV;
    }
    sumV += oneV / divV;
  }
  return horizontal_add(sumV);
}
Jirka
  • 365
  • 2
  • 8