I'm in the process of planing a vector math library. It should work based on expression templates to express chained math operations on vectors, like e.g.
Vector b = foo (bar (a));
where
a
is a source vectorbar
returns an instance of an expression template that supplies functions transform each element ofa
foo
returns an instance of an expression template that supplies functions that transform each element of the expression fed into it (bar (a)
)Vector::operator=
actually invokes the expression returned byfoo
in a for-loop and stores it to b
Well, the basic example of an expression template so far.
However, the plan is that the templates should not only supply the implementation for an per-element transformation in an operator[]
fashion, but should also supply the implementation for a vectorised version of that. At best, the template returned by foo
supplies a fully inlined function that loads N consecutive elements of a
into a SIMD register and executes the chained operations of bar
and foo
using SIMD instructions and returns a SIMD register as return value.
This should also be not too hard to implement so far.
Now, on x86_64 CPUs, I'd love to optionally use AVX if available, but I want the implementation to invoke the AVX based implementation only if the CPU we currently run on supports AVX. Otherwise an SSE based implementation should be used, or a non-SIMD fallback as last resort. Usually in cases like this, the various implementations would be put into different translation units, each being compiled with the according vector instruction flags and then the runtime code would contain some dispatching logic.
However with the expression template approach, the chain of instructions to be executed will be defined in the application code, that actually invokes some chained expression templates which translate to a very specific template instantiation. I explicitly want the compiler to be able to fully inline the chain of expressions to gain maximum performance, therefore all the implementation should be header-only. But compiling the whole user code with e.g. an AVX flag would be a bad idea, since the compiler would probably insert some AVX instructions in other parts of the code, that might not be supported at runtime.
So, is there any clever solution for that? Basically, I'm looking for a way to force the compiler to only generate some SIMD instructions for e.g. a function template scope, like in this piece of pseudocode (I know that the code in Vector::operator=
does not reflect alignment etc. and you have to imagine that I have some fancy C++ wrapper for the native SIMD registers – just want to point out the core question here)
template <class T, class Src>
struct MyFancyExpressionTemplate
{
// Compile this function with AVX
forcedinline SIMDRegister<T, 128> eval128 (size_t i)
{
// SSE implementation here
}
// compile this function with AVX
forcedinline SIMDRegister<T, 256> eval256 (size_t i)
{
// AVX implementation here
}
}
template <class T>
class Vector
{
public:
// A whole lot of functions
template <class Expression>
requires isExpressionTemplate<Expression>
void operator= (Expression&& e)
{
const auto n = size();
if (avxIsAvailable)
{
// Compile this part with AVX
for (size_t i = 0; i < n; i+= SIMDRegister<T, 256>::numElements)
e.eval256 (i).store (mem + i);
return;
}
if (sseIsAvailable)
{
// Compile this part with SSE
for (size_t i = 0; i < n; i+= SIMDRegister<T, 128>::numElements)
e.eval128 (i).store (mem + i);
return;
}
for (size_t i = 0; i < size(); ++i)
mem[i] = e[i];
}
// A whole lot of further functions
}
To my knowledge this is not possible, but I might be overlooking some fancy #pragma
or some trick to re-organize my code to make that work. Any idea for completely different approaches to the problem, that supports the goal of giving the compiler room to inline the whole chain of SIMD operations is greatly appreciated.
We are targeting (Apple) Clang 13+ and MSVC 2022, but thinking of switching to Clang for Windows as well. We use C++ 20.