3

Typically, I might write a simd loop like:

float * x = (float *) malloc(10 * sizeof(float));
float * y = (float *) malloc(10 * sizeof(float));

for(int i = 0; i < 10; i++)
    y[i] = 10;

#pragma omp simd
for(int i = 0; i < 10; i++)
    x[i] = y[i]*y[i];

And suppose I have two tasks in mind:

float square(float x) {
    return x * x;
}
float halve(float x) {
    return x / 2.;
}

And an omp loop primitive:

void apply_simd(float * x, float * y, int length, float (*simd_func)(float c)){
    #pragma omp simd
    for(int i = 0; i < length; i++)
         x[i] = simd_func(y[i])
}

Is this legal within the parameters of SIMD? Or is the compiler going to produce less efficient code than if I explicitly inline everything?

Does writing:

float inline square(float x){ ... } 

Change anything? Or can I expect to benefit from SIMD only if I explicitly write down the operation in terms of native functions/operators only?

Chris
  • 28,822
  • 27
  • 83
  • 158

1 Answers1

3

Yes, with optimization enabled (-O3 -march=native), modern compilers can reliably inline through function pointers if these conditions are met:

  • the function-pointer has a compile-time constant value
  • it's pointing to a function whose definition the compiler can see

That sounds simple to ensure, but if this code is used in a shared lib on Unix/Linux (compiled with -fPIC) then symbol-interposition rules mean that float halve(float x) { return x * 0.5f; }1 can't inline even within the same translation unit. See Sorry state of dynamic libraries on Linux.

Use the inline keyword to allow inlining even when building a shared library; like static it allows the compiler not to emit a stand-alone definition of the function at all, if it decides to inline at every call-site.

Use inline on halve, square, and apply_simd. (Because apply_simd needs to inline into a caller that passes halve as a function arg. A stand-alone definition of apply_simd is useless because it can't inline the unknown function.) If they're in a .cpp instead of a .h, you might as well make them static as well or instead, otherwise just make them inline.


Do as much work as possible in one pass

I suspect you wanted to write something really inefficient like this:

apply_simd(x, y, length, halve);   // copy y to x
apply_simd(x, x, length, square);  // then update x in-place
// NEVER DO THIS, make one function that does both things
// with gcc and clang, compiles as written to two separate loops.

A loop that only does a copy-and-multiply by 0.5f will usually bottleneck on memory bandwidth. Modern CPUs like Haswell (or Skylake) have twice as much FMA/mul (or add) throughput (2x 256-bit vectors per clock) as they have store bandwidth (1x 256-bit vector per clock to L1d). Computational intensity matters. Don't gimp your code by writing multiple loops that do separate trivial operations

With any loop unrolling, or if data doesn't fit in L1d, the throughput for SIMD x[i] = 0.25f * y[i]*y[i] will be the same as for either of those operations alone.

I checked the asm output from g++ 8.2 and clang++ 6.0 on the Godbolt compiler explorer. Even with __restrict to tell it that x and y don't overlap, the compilers still made 2 separate loops.


Pass a lambda as a function pointer

We can easily compose arbitrary operations into a single function using a lambda, and pass it as a function pointer. This solves the above problem of creating two separate loops, while still giving you the desired syntax of wrapping the loop up in a function.

If your halve(float) function was a placeholder for something non-trivial, you can use it in a lambda to compose it with something else. e.g. square(halve(a))

In earlier C++ standards, you need to assign the lambda to a function pointer. (Lambda as function parameter)

// your original function mostly unchanged, but with size_t and inline
inline  // allows inlining even with -fPIC
void apply_simd(float * x, const float *y, size_t length, float (*simd_func)(float c)){
    #pragma omp simd
    for(size_t i = 0; i < length; i++)
         x[i] = simd_func(y[i]);
}

C++11 caller:

// __restrict isn't needed with OpenMP, but you might want to assert non-overlapping for better auto-vectorization with non-OpenMP compilers.
void test_lambda(float *__restrict x, const float *__restrict y, size_t length)
{
    float (*funcptr)(float) = [](float a) -> float {
         float h=0.5f*a; // halve first allows vmulps with a memory source operand
         return h*h;    // 0.25 * a * a doesn't optimize to that with clang :/
    };

    apply_simd(x, y, length, funcptr);
}

In C++17 it's even easier, and Just Works with a literal anonymous lambda:

void test_lambda17(float *__restrict x, const float *__restrict y, size_t length)
{
    apply_simd(x, y, length, [](float a) {
        float h = 0.5f*a;
        return h * h;
      }
    );
}

These both compile efficiently with gcc and clang, to an inner loop like this (Godbolt compiler explorer).

.L4:
    vmulps  ymm0, ymm1, YMMWORD PTR [rsi+rax]
    vmulps  ymm0, ymm0, ymm0
    vmovups YMMWORD PTR [rdi+rax], ymm0
    add     rax, 32
    cmp     rax, rcx
    jne     .L4

clang unrolls some, and probably comes close to one 256-bit vector loaded+stored per clock, with 2 multiplies. (A non-indexed addressing mode would make that achievable with unrolling to hide the two pointer increments. Silly compiler. :/)


Lambda or function pointer as a template parameter

With a local lambda as a template param (defined inside the function), the compiler can definitely always inline. But (due to a gcc bug) this isn't currently usable.

But with only a function pointer, it doesn't actually help catch cases where you forgot to use the inline keyword or otherwise broke the compiler's ability to inline. It only means the function address must be a dynamic-link-time constant (i.e. not known until runtime binding of dynamic libraries), so it doesn't save you from symbol interposition. At compile time with -fPIC, the compiler still doesn't know if the version of a global function it can see will be the one that is actually resolved at link time, or if LD_PRELOAD or a symbol in the main executable will override it. So it just emits code that loads a function pointer from the GOT, and calls it in a loop. SIMD is of course impossible.

It does stop you from shooting yourself in the foot by passing function pointers around in ways that can't always inline, though. Probably with constexpr you can still pass them around as args before using in a template, though. So you might want to use this if it wasn't for a gcc bug that stops you from using it with lambdas.

C++17 allows passing an automatic-storage lambda with no captures as a function object. (Previous standards required external or internal (static) linkage for functions passed as template params.)

template <float simd_func(float c)>
void apply_template(float *x, const float *y, size_t length){
    #pragma omp simd
    for(size_t i = 0; i < length; i++)
         x[i] = simd_func(y[i]);
}


void test_lambda(float *__restrict x, const float *__restrict y, size_t length)
{
    // static // even static doesn't help work around the gcc bug
    constexpr auto my_op = [](float a) -> float {
         float h=0.5f*a; // halve first allows vmulps with a memory source operand
         return h*h;    // 0.25 * a * a doesn't optimize to that with clang :/
    };

    // I don't know what the unary + operator is doing here, but some examples use it
    apply_lambda<+my_op>(x, y, length); // clang accepts this, gcc doesn't
}

clang compiles it just fine, but g++ incorrectly rejects that even with -std=gnu++17

Unfortunately, gcc has a bug (83258) with using lambdas this way. See Can I use the result of a C++17 captureless lambda constexpr conversion operator as a function pointer template non-type argument? for details.

Still, we can use regular functions with the template.

// `inline` is still necessary for it to actually inline with -fPIC (in a shared lib)
inline float my_func(float a) { return 0.25f * a*a;}

void test_template(float *__restrict x, const float *__restrict y, size_t length)
{
    apply_lambda<my_func>(x, y, length);   // not actually a lambda, just a function
}

Then we get an inner loop like this from g++8.2 -O3 -fopenmp -march=haswell. Notice that I used 0.25f * a * a; instead of doing halve first, to see what kind of bad code we get. This is what g++8.2 does with that.

.L25:
    vmulps  ymm0, ymm1, YMMWORD PTR [rsi+rax]   # ymm0 = 0.25f * y[i+0..7]
    vmulps  ymm0, ymm0, YMMWORD PTR [rsi+rax]   # reload the same vector again
    vmovups YMMWORD PTR [rdi+rax], ymm0        # store to x[i+0..7]
    add     rax, 32
    cmp     rax, rcx
    jne     .L25

Reloading the same vector twice to save instructions would possible be a good idea if gcc wasn't using an indexed addressing mode, which stops it from micro-fusing on Haswell/Skylake. So this loop actually issues as 7 uops, running at best 7/4 cycles per iteration.

With unrolling, approaching the 2 read + 1 write per clock limit for wide vectors is apparently a problem for sustained operation, according to Intel's optimization manual. (They say Skylake might sustain 82 bytes per clock, vs. the peak of 96 loaded + stored in one clock.) It's especially unwise if the data isn't known to be aligned, and gcc8 has switched to an optimistic strategy for unknown-alignment data: use unaligned loads/stores and let the hardware handle cases when there isn't 32-byte alignment. gcc7 and earlier aligns the pointer before the main loop, and loads the vector only once.


Footnote 1: Fortunately gcc and clang can optimize x / 2. into x * 0.5f, avoiding promotion to double.

Using multiply instead of divide is possible without -ffast-math because 0.5f is exactly representable as a float, unlike fractions where the denominator isn't a power of 2.

But note that 0.5 * x does not optimize to 0.5f * x; gcc and clang do actually expand to double and back. I'm not sure if that's a missed optimization vs. x / 2., or if there's a real semantic difference that stops it from optimizing a double constant to a float when it can be represented exactly as a float.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • this is one of the most comprehensive deployments of knowledge on any topic that I have ever run across. Amazing. Thank you. With respect to all of this, the goal was syntax and code markup for reuse--but your response was profound. I will check these this godbolt out. – Chris Aug 08 '18 at 22:21