Yes, with optimization enabled (-O3 -march=native
), modern compilers can reliably inline through function pointers if these conditions are met:
- the function-pointer has a compile-time constant value
- it's pointing to a function whose definition the compiler can see
That sounds simple to ensure, but if this code is used in a shared lib on Unix/Linux (compiled with -fPIC
) then symbol-interposition rules mean that float halve(float x) { return x * 0.5f; }
1 can't inline even within the same translation unit. See Sorry state of dynamic libraries on Linux.
Use the inline
keyword to allow inlining even when building a shared library; like static
it allows the compiler not to emit a stand-alone definition of the function at all, if it decides to inline at every call-site.
Use inline
on halve
, square
, and apply_simd
. (Because apply_simd
needs to inline into a caller that passes halve
as a function arg. A stand-alone definition of apply_simd
is useless because it can't inline the unknown function.) If they're in a .cpp
instead of a .h
, you might as well make them static
as well or instead, otherwise just make them inline
.
Do as much work as possible in one pass
I suspect you wanted to write something really inefficient like this:
apply_simd(x, y, length, halve); // copy y to x
apply_simd(x, x, length, square); // then update x in-place
// NEVER DO THIS, make one function that does both things
// with gcc and clang, compiles as written to two separate loops.
A loop that only does a copy-and-multiply by 0.5f
will usually bottleneck on memory bandwidth. Modern CPUs like Haswell (or Skylake) have twice as much FMA/mul (or add) throughput (2x 256-bit vectors per clock) as they have store bandwidth (1x 256-bit vector per clock to L1d). Computational intensity matters. Don't gimp your code by writing multiple loops that do separate trivial operations
With any loop unrolling, or if data doesn't fit in L1d, the throughput for SIMD x[i] = 0.25f * y[i]*y[i]
will be the same as for either of those operations alone.
I checked the asm output from g++ 8.2 and clang++ 6.0 on the Godbolt compiler explorer. Even with __restrict
to tell it that x and y don't overlap, the compilers still made 2 separate loops.
Pass a lambda as a function pointer
We can easily compose arbitrary operations into a single function using a lambda, and pass it as a function pointer. This solves the above problem of creating two separate loops, while still giving you the desired syntax of wrapping the loop up in a function.
If your halve(float)
function was a placeholder for something non-trivial, you can use it in a lambda to compose it with something else. e.g. square(halve(a))
In earlier C++ standards, you need to assign the lambda to a function pointer. (Lambda as function parameter)
// your original function mostly unchanged, but with size_t and inline
inline // allows inlining even with -fPIC
void apply_simd(float * x, const float *y, size_t length, float (*simd_func)(float c)){
#pragma omp simd
for(size_t i = 0; i < length; i++)
x[i] = simd_func(y[i]);
}
C++11 caller:
// __restrict isn't needed with OpenMP, but you might want to assert non-overlapping for better auto-vectorization with non-OpenMP compilers.
void test_lambda(float *__restrict x, const float *__restrict y, size_t length)
{
float (*funcptr)(float) = [](float a) -> float {
float h=0.5f*a; // halve first allows vmulps with a memory source operand
return h*h; // 0.25 * a * a doesn't optimize to that with clang :/
};
apply_simd(x, y, length, funcptr);
}
In C++17 it's even easier, and Just Works with a literal anonymous lambda:
void test_lambda17(float *__restrict x, const float *__restrict y, size_t length)
{
apply_simd(x, y, length, [](float a) {
float h = 0.5f*a;
return h * h;
}
);
}
These both compile efficiently with gcc and clang, to an inner loop like this (Godbolt compiler explorer).
.L4:
vmulps ymm0, ymm1, YMMWORD PTR [rsi+rax]
vmulps ymm0, ymm0, ymm0
vmovups YMMWORD PTR [rdi+rax], ymm0
add rax, 32
cmp rax, rcx
jne .L4
clang unrolls some, and probably comes close to one 256-bit vector loaded+stored per clock, with 2 multiplies. (A non-indexed addressing mode would make that achievable with unrolling to hide the two pointer increments. Silly compiler. :/)
Lambda or function pointer as a template parameter
With a local lambda as a template param (defined inside the function), the compiler can definitely always inline. But (due to a gcc bug) this isn't currently usable.
But with only a function pointer, it doesn't actually help catch cases where you forgot to use the inline
keyword or otherwise broke the compiler's ability to inline. It only means the function address must be a dynamic-link-time constant (i.e. not known until runtime binding of dynamic libraries), so it doesn't save you from symbol interposition. At compile time with -fPIC
, the compiler still doesn't know if the version of a global function it can see will be the one that is actually resolved at link time, or if LD_PRELOAD
or a symbol in the main executable will override it. So it just emits code that loads a function pointer from the GOT, and calls it in a loop. SIMD is of course impossible.
It does stop you from shooting yourself in the foot by passing function pointers around in ways that can't always inline, though. Probably with constexpr
you can still pass them around as args before using in a template, though. So you might want to use this if it wasn't for a gcc bug that stops you from using it with lambdas.
C++17 allows passing an automatic-storage lambda with no captures as a function object. (Previous standards required external or internal (static
) linkage for functions passed as template params.)
template <float simd_func(float c)>
void apply_template(float *x, const float *y, size_t length){
#pragma omp simd
for(size_t i = 0; i < length; i++)
x[i] = simd_func(y[i]);
}
void test_lambda(float *__restrict x, const float *__restrict y, size_t length)
{
// static // even static doesn't help work around the gcc bug
constexpr auto my_op = [](float a) -> float {
float h=0.5f*a; // halve first allows vmulps with a memory source operand
return h*h; // 0.25 * a * a doesn't optimize to that with clang :/
};
// I don't know what the unary + operator is doing here, but some examples use it
apply_lambda<+my_op>(x, y, length); // clang accepts this, gcc doesn't
}
clang compiles it just fine, but g++ incorrectly rejects that even with -std=gnu++17
Unfortunately, gcc has a bug (83258) with using lambdas this way. See Can I use the result of a C++17 captureless lambda constexpr conversion operator as a function pointer template non-type argument? for details.
Still, we can use regular functions with the template.
// `inline` is still necessary for it to actually inline with -fPIC (in a shared lib)
inline float my_func(float a) { return 0.25f * a*a;}
void test_template(float *__restrict x, const float *__restrict y, size_t length)
{
apply_lambda<my_func>(x, y, length); // not actually a lambda, just a function
}
Then we get an inner loop like this from g++8.2 -O3 -fopenmp -march=haswell
. Notice that I used 0.25f * a * a;
instead of doing halve
first, to see what kind of bad code we get. This is what g++8.2 does with that.
.L25:
vmulps ymm0, ymm1, YMMWORD PTR [rsi+rax] # ymm0 = 0.25f * y[i+0..7]
vmulps ymm0, ymm0, YMMWORD PTR [rsi+rax] # reload the same vector again
vmovups YMMWORD PTR [rdi+rax], ymm0 # store to x[i+0..7]
add rax, 32
cmp rax, rcx
jne .L25
Reloading the same vector twice to save instructions would possible be a good idea if gcc wasn't using an indexed addressing mode, which stops it from micro-fusing on Haswell/Skylake. So this loop actually issues as 7 uops, running at best 7/4 cycles per iteration.
With unrolling, approaching the 2 read + 1 write per clock limit for wide vectors is apparently a problem for sustained operation, according to Intel's optimization manual. (They say Skylake might sustain 82 bytes per clock, vs. the peak of 96 loaded + stored in one clock.) It's especially unwise if the data isn't known to be aligned, and gcc8 has switched to an optimistic strategy for unknown-alignment data: use unaligned loads/stores and let the hardware handle cases when there isn't 32-byte alignment. gcc7 and earlier aligns the pointer before the main loop, and loads the vector only once.
Footnote 1: Fortunately gcc and clang can optimize x / 2.
into x * 0.5f
, avoiding promotion to double
.
Using multiply instead of divide is possible without -ffast-math
because 0.5f
is exactly representable as a float
, unlike fractions where the denominator isn't a power of 2.
But note that 0.5 * x
does not optimize to 0.5f * x
; gcc and clang do actually expand to double
and back. I'm not sure if that's a missed optimization vs. x / 2.
, or if there's a real semantic difference that stops it from optimizing a double constant to a float when it can be represented exactly as a float
.