Parallelizing inner loop with residual calculations in OpenMP with SSE vectorization

Question

I'm trying to parallelizing the inner loop of a program that has data dependencies (min) outside the scope of the loops. I'm having an issue where the residual calculations occuring outside the scope of the inner j loop. The code gets errors if the "#pragma omp parallel" part is included on the j loop even if the loop doesn't run at all due to a k value being too low. say (1,2,3) for example.

for (i = 0; i < 10; i++)
  {
    #pragma omp parallel for shared(min) private (j, a, b, storer, arr) //
    for (j = 0; j < k-4; j += 4)
    {
      mm_a = _mm_load_ps(&x[j]);
      mm_b = _mm_load_ps(&y[j]);
      mm_a = _mm_add_ps(mm_a, mm_b);
      _mm_store_ps(storer, mm_a);

      #pragma omp critical
      {
      if (storer[0] < min)
      {
        min = storer[0];
      }
      if (storer[1] < min)
      {
        min = storer[1];
      }
      //etc
      }
    }
    do
    {
        #pragma omp critical
        {
        if (x[j]+y[j] < min)
        {
          min = x[j]+y[j];
        }    
        } 
      }
    } while (j++ < (k - 1));
    round_min = min
  }

Can you be a bit more specific about what error you are getting? Are these compile-time errors or are you getting wrong results? My first though would be that you have a race condition on the `min` variable, so the obvious solution would be `reduction(min:min)` clause that you'll have to add to the `parallel for` directive. — Michael Klemm, Mar 30 '21 at 11:41
Instead of all the branching you should just use `_mm_min_ps` to get a vector of 4 minimal values and reduce this to one element at the end (actually, due to latency, you need multiple registers). There is probably a duplicate for this. — chtz, Mar 30 '21 at 12:42
Nevertheless, you need to post an actual [mre] (without `//etc` comments, and without using undeclared variables). What is the outer-loop actually for? Do `x` and `y` change? — chtz, Mar 30 '21 at 12:45

Jérôme Richard · Accepted Answer · 2021-04-01T12:09:30.910

The j-based loop is a parallel loop so you cannot use j after the loop. This is especially true since you explicitly put j as private, so only visible locally in the thread but not outside the parallel region. You can explicitly compute the position of the remaining j value using (k-4+3)/4*4 just after the parallel loop.

Furthermore, here is few important points:

You may not really need to vectorize the code yourself: you can use omp simd reduction. OpenMP can do all the boring job of computing the residual calculations for you automatically. Moreover, the code will be portable and much simpler. The generated code may also likely be faster than yours. Note however that some compilers might not be able to vectorize the code (GCC and ICC does, while Clang and MSVC often need some help).
Critical section (omp critical) are very costly. In your case this will just annihilate any possible improvement related to the parallel section. The code will likely be slower due to cache-line bouncing.
Reading data written by _mm_store_ps is inefficient here although some compiler (like GCC) may be able to understand the logic of your code and generate a faster implementation (extracting lane data).
Horizontal SIMD reductions inefficient. Use vertical ones that are much faster and that can be easily used here.

Here is a corrected code taking into account the above points:

for (i = 0; i < 10; i++)
{
    // Assume min is already initialized correctly here

    #pragma omp parallel for simd reduction(min:min) private(j)
    for (j = 0; j < k; ++j)
    {
        const float tmp = x[j] + y[j];
        if(tmp < min)
            min = tmp;
    }

    // Use min here
}

The above code is vectorized correctly on x86 architecture on GCC/ICC (both with -O3 -fopenmp), Clang (with -O3 -fopenmp -ffastmath) and MSVC (with /O2 /fp:precise -openmp:experimental).

You might be able to hand-hold compilers into better asm with `min = min — Peter Cordes, Apr 01 '21 at 19:04
Huh, that ternary actually defeats clang's auto-vectorization even with `-ffast-math` without openMP https://godbolt.org/z/d5a6Ma9va. (Which *is* necessary; neither min() implementation is associative for NaNs or signed-zero.) I simplified the outer loop so it's just reducing 1024 float elements to scalar once, so we can look at that asm. — Peter Cordes, Apr 01 '21 at 19:19
With OpenMP, clang just parallelizes, doesn't vectorize. (just `minss`, no `minps`) https://godbolt.org/z/s1YzqPq46. GCC does vectorize even with just `-O3 -fopenmp`, without `-ffast-math`, but that doesn't get GCC to unroll with multiple accumulators to hide MINPS latency. :/ That's normal for GCC's standard auto-vectorization (without profile-guided optimization `-fprofile-use`), but one might hope its OpenMP vectorizer would be more aggressive. — Peter Cordes, Apr 01 '21 at 19:21
Yes, indeed, Clang behave strangely (and MSVC too). This is why I edited the answers to replace the ternary with a conditional. I don't know why Clang auto-vectorizer fail to recognize a minimum-based reduction. It seems like a bug or an internal limitation. I would be very interested to know more about why he behaves this way. — Jérôme Richard, Apr 01 '21 at 19:26
I'd guess maybe limited / less mature support for OpenMP? Parallelizing the outer loop and letting normal auto-vectorization work on the inner loop might be good, depending on what that does to locality. Although I guess for cache-blocking you really want each core to be touching the same subset of the data repeatedly, not have each core loop over all the data. — Peter Cordes, Apr 01 '21 at 19:28

Parallelizing inner loop with residual calculations in OpenMP with SSE vectorization

1 Answers1