5

Does the OpenMP standard guarantee #pragma omp simd to work, i.e. should the compilation fail if the compiler can't vectorize the code?

#include <cstdint>
void foo(uint32_t r[8], uint16_t* ptr)
{
    const uint32_t C = 1000;
    #pragma omp simd
    for (int j = 0; j < 8; ++j)
        if (r[j] < C)
            r[j] = *(ptr++);
}

gcc and clang fail to vectorize this but do not complain at all (unless you use -fopt-info-vec-optimized-missed and the like).

Trass3r
  • 5,858
  • 2
  • 30
  • 45

1 Answers1

5

No, it is not guaranteed. Relevant portions of the OpenMP 4.5 standard that I could find (emphasis mine):

(1.3) When any thread encounters a simd construct, the iterations of the loop associated with the construct may be executed concurrently using the SIMD lanes that are available to the thread.

(2.8.1) The simd construct can be applied to a loop to indicate that the loop can be transformed into a SIMD loop (that is, multiple iterations of the loop can be executed concurrently using SIMD instructions).

(Appendix C) The number of iterations that are executed concurrently at any given time is implementation defined.

(1.2.7) implementation defined: Behavior that must be documented by the implementation, and is allowed to vary among different compliant implementations. An implementation is allowed to define this behavior as unspecified.

Community
  • 1
  • 1
Zulan
  • 21,896
  • 6
  • 49
  • 109
  • Looks like it. But then it's barely better than just hoping for auto-vectorization. – Trass3r Jan 11 '18 at 20:10
  • 1
    It may give the compiler some information about possible transformation that it cannot automatically deduce due to language issues (e.g. aliasing). TBH, personally, I'm not a huge fan of this kind of hint. – Zulan Jan 11 '18 at 20:15
  • I know, it's just a bit nicer and more portable than '__restrict' and all those compiler-specific loop and vector pragmas. – Trass3r Jan 11 '18 at 20:47
  • 1
    But it's not useful if you don't even get a compiler warning once it fails (esp. later on as a regression due to change of code or compiler version). And it may easily and silently lead to wrong code as you basically force the compiler to ignore its own analysis: https://godbolt.org/g/dp5JVR (It dropped the ptr++ operation!) – Trass3r Jan 11 '18 at 20:53
  • It's basically for cases if the programmer is more clever than the compiler. I'd argue that is not often the case - including myself. – Zulan Jan 11 '18 at 22:20
  • 1
    @Trass3r: Are you sure that's actually legal with `#pragma openmp simd`, and not a compiler bug? OTOH, I see ICC17 does that too (but it doesn't understand `-march=haswell`, only `-march=native` or `-xHOST`, so it uses AVX512.) GCC doesn't vectorize it at all with `-fopenmp`. https://godbolt.org/g/H7u3Cw. But it doesn't vectorize even if we remove the `++` so ICC's output would be valid even without the `#pragma` https://godbolt.org/g/rV2i5m. Note that you don't usually need `simdlen`. It might matter with `gcc -march=bdver2` or something (which sets -mprefer-avx128`) – Peter Cordes Jan 11 '18 at 23:04
  • 1
    Intel compilers default to throwing a diagnostic when omp simd is enabled but doesn't result in vectorization. gnu and clang compilers have many more cases where omp simd is ignored, so such diagnostics might be annoying. The case presented doesn't look like one where you would want attempted vectorization, even if a way might be thought up to deal with the loop carried dependency. – tim18 Jan 11 '18 at 23:24
  • @Peter In the compress section of https://software.intel.com/en-us/articles/explicit-vector-programming-best-known-methods they say it's incorrect to use the pragma for this but still happily compile it. – Trass3r Jan 11 '18 at 23:37
  • 1
    @tim18: Auto-vectorization could in theory use the same technique [that's possible for manual vectorization](https://stackoverflow.com/questions/48174640/avx2-expand-contiguous-elements-to-a-sparse-vector-based-on-a-condition-like-a), especially with AVX512 where there's an instruction for this: [`vpexpandd ymm1{k1}, ymm2 `](https://github.com/HJLebbink/asm-dude/wiki/VPEXPANDD). In practice, auto-vectorization doesn't currently handle it (e.g. with no pragma, just plain old autovec, not even with ICC18 targeting AVX512). – Peter Cordes Jan 11 '18 at 23:54
  • By the way, I just found out about clang's `-Wpass-failed` which notifies you about failed vectorization. (Unless the loop unroller kicks in: https://bugs.llvm.org/show_bug.cgi?id=35925) I wonder if other compilers have something similar. – Trass3r Jan 13 '18 at 11:04
  • Yes, Intel Fortran does compile the pack intrinsic to pack instruction when appropriate target is specified. It might be more amenable in C if ptr were given local scope as j is so that there is clearly no dependency on final value. omp simd doesn't support firstprivate so that wouldn't be a way to fix that deficiency in the example. – tim18 Jan 15 '18 at 16:34