Dynamic dispatching of different SIMD implementations in header-only code. Possible at all?

Question

I'm in the process of planing a vector math library. It should work based on expression templates to express chained math operations on vectors, like e.g.

Vector b = foo (bar (a));

where

a is a source vector
bar returns an instance of an expression template that supplies functions transform each element of a
foo returns an instance of an expression template that supplies functions that transform each element of the expression fed into it (bar (a))
Vector::operator= actually invokes the expression returned by foo in a for-loop and stores it to b

Well, the basic example of an expression template so far.

However, the plan is that the templates should not only supply the implementation for an per-element transformation in an operator[] fashion, but should also supply the implementation for a vectorised version of that. At best, the template returned by foo supplies a fully inlined function that loads N consecutive elements of a into a SIMD register and executes the chained operations of bar and foo using SIMD instructions and returns a SIMD register as return value.

This should also be not too hard to implement so far.

Now, on x86_64 CPUs, I'd love to optionally use AVX if available, but I want the implementation to invoke the AVX based implementation only if the CPU we currently run on supports AVX. Otherwise an SSE based implementation should be used, or a non-SIMD fallback as last resort. Usually in cases like this, the various implementations would be put into different translation units, each being compiled with the according vector instruction flags and then the runtime code would contain some dispatching logic.

However with the expression template approach, the chain of instructions to be executed will be defined in the application code, that actually invokes some chained expression templates which translate to a very specific template instantiation. I explicitly want the compiler to be able to fully inline the chain of expressions to gain maximum performance, therefore all the implementation should be header-only. But compiling the whole user code with e.g. an AVX flag would be a bad idea, since the compiler would probably insert some AVX instructions in other parts of the code, that might not be supported at runtime.

So, is there any clever solution for that? Basically, I'm looking for a way to force the compiler to only generate some SIMD instructions for e.g. a function template scope, like in this piece of pseudocode (I know that the code in Vector::operator= does not reflect alignment etc. and you have to imagine that I have some fancy C++ wrapper for the native SIMD registers – just want to point out the core question here)

template <class T, class Src>
struct MyFancyExpressionTemplate
{

    // Compile this function with AVX
    forcedinline SIMDRegister<T, 128> eval128 (size_t i)
    {
        // SSE implementation here
    }

    // compile this function with AVX
    forcedinline SIMDRegister<T, 256> eval256 (size_t i)
    {
        // AVX implementation here
    }

}

template <class T>
class Vector
{
public:
    // A whole lot of functions
 
    template <class Expression>
    requires isExpressionTemplate<Expression>
    void operator= (Expression&& e)
    {
        const auto n = size();

        if (avxIsAvailable)
        {
            // Compile this part with AVX
            for (size_t i = 0; i < n; i+= SIMDRegister<T, 256>::numElements)
                e.eval256 (i).store (mem + i);
            return;
        }

        if (sseIsAvailable)
        {
            // Compile this part with SSE
            for (size_t i = 0; i < n; i+= SIMDRegister<T, 128>::numElements)
                e.eval128 (i).store (mem + i);
            return;
        }

        for (size_t i = 0; i < size(); ++i)
            mem[i] = e[i];
    }
    
    // A whole lot of further functions
}

To my knowledge this is not possible, but I might be overlooking some fancy #pragma or some trick to re-organize my code to make that work. Any idea for completely different approaches to the problem, that supports the goal of giving the compiler room to inline the whole chain of SIMD operations is greatly appreciated.

We are targeting (Apple) Clang 13+ and MSVC 2022, but thinking of switching to Clang for Windows as well. We use C++ 20.

According to steam survey, only 4.19% of their users don’t have AVX1 https://store.steampowered.com/hwsurvey Consider dropping support of these 10+ years old CPUs, might save a lot of complexity in your code. — Soonts, Aug 17 '22 at 19:28
Maybe [`__attribute__((target(...)))`](https://clang.llvm.org/docs/AttributeReference.html#target) ("[multiversioning](https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/Function-Multiversioning.html)") can help? It is supported by gcc and clang, but apparently there is [no analogue for MSVC](https://stackoverflow.com/q/34714146/3740047). — Sedenion, Aug 17 '22 at 20:07
@Soonts: CPUs as recent as Skylake-Pentium are lacking AVX1, unfortunately. One of the most recent Intel CPU without AVX is [Pentium Gold G6405](https://ark.intel.com/content/www/us/en/ark/products/201901/intel-pentium-gold-g6405-processor-4m-cache-4-10-ghz.html) 2c4t Comet Lake launched in Q1 2021, or the mobile version from Q4'2019. And of course there are "netbook" laptops with Atom CPUs, Silvermont-family before Gracemont doesn't have AVX, but people don't expect those to be game-capable. See [Which mobile windows devices don't support AVX2](https://stackoverflow.com/q/73134994) — Peter Cordes, Aug 17 '22 at 21:16
@Soonts: So yeah, there aren't a lot of CPUs without AVX, but unfortunately thanks to Intel they're not all ancient. — Peter Cordes, Aug 17 '22 at 21:17
@PeterCordes I agree but still, software development is an engineering discipline, it’s all about balancing tradeoffs. Ignoring AVX costs non-trivial performance penalty for 95% of users. Supporting both SSE and AVX increases software development costs. Dropping SSE costs 5% of potential users of the computers older than 10 years, or computers with these inexpensive Pentium’s. Pretty sure for different software, each of these 3 tactics might be the best one. — Soonts, Aug 17 '22 at 21:55
@Sedenion wow, this seems to be exactly what I was looking for. Just set up a quick example on Compiler Explorer to verify that it also works in some templated code. Interesting finding: While MSVC doesn't know the attributes, it seems to compile the mixed code fine, even without specifying explicit architecture flags. Not sure if there is any downside with this approach (note the stupid macro that just resolves to nothing when compiling with MSVC) https://godbolt.org/z/EKjnToazT — PluginPenguin, Aug 17 '22 at 22:01
Thanks @Soonts for bringing that up. I assume that the steam survey was focused on steam users, which might or might not be representative for our users. In the end I'm not in the position to decide dropping SSE support, it's a point worth discussing with our product management team, but I'm sure that we won't drop SSE support at this time. — PluginPenguin, Aug 17 '22 at 22:06
@Soonts: Yeah, you still might decide to require AVX, but people need to be aware that there are some nearly-new PCs out there without it. For people that have no idea about ISA extensions, they're likely to blame the developer for making software that won't work, not Intel for selling them a crippled CPU. IDK if it's likely for reports of software not working to scare off other users (for whom it would work fine); I assume it would depend on what kind of target market (if commercial) the software is aimed at. That's a hypothetical worst-case, but you need to be aware of recent non-AVX CPUs — Peter Cordes, Aug 17 '22 at 22:16
@PluginPenguin It’s their job to make decisions like that, but it’s your job to tell them about these tradeoffs. I have no idea about your users. For some CAM/CAE app I’m working on, after hearing these tradeoffs from me, the management decided to list AVX2 CPU as a requirement. AFAIK just a single person was affected over 2 years which followed, trying to evaluate the software on a workstation with an Ivy Bridge Xeon. — Soonts, Aug 17 '22 at 22:20
@Soonts: Another notable non-AVX "CPU" is Rosetta. Obviously not your first choice for performance code, but suppose you have a closed-source x86 binary linked against OP's library, and you want to run it on your Apple silicon... — Nate Eldredge, Aug 25 '22 at 06:12
Can't this be done easily (or maybe not so easily, depending) via your _build system_? Include different SIMD headers for different architectures (determined by build system settings, the same ones that select the architecture flag for the compiler)? Either via file name or via include path? (P.S. remember you don't have to _define_ methods - not even methods of a _template_ class_ - inline in the class decl.) — davidbak, Aug 25 '22 at 19:59

score 2 · Accepted Answer · answered Aug 25 '22 at 19:51

To have a proper answer, as I already noted in the comments, on clang (and gcc) you can use multiversioning to achieve it, i.e. function attributes such as __attribute__((target("AVX"))). The attribute is also supported by clang. So it can look like this:

template <class T, class Src>
struct MyFancyExpressionTemplate
{
    // Compile this function with SSE
    SIMDRegister<T, 128> eval128 (size_t i) __attribute__((target("SSE2")))
    {
        // SSE implementation here
    }

    // compile this function with AVX
    SIMDRegister<T, 256> eval256 (size_t i) __attribute__((target("AVX")))
    {
        // AVX implementation here
    }
}

But note that this will prevent the compiler from inlining the functions into other functions that do not have the same target options. So the functions containing the loops that call these functions should probably also have these targets set.

On MSVC, something similar is neither available nor required. The /arch compiler flag just tells the compiler to use the instruction set everywhere. If you do not enable AVX and then use the corresponding intrinsics, you are responsible to ensure that the machine actually supports AVX when it attempts to execute the code.

There are other posts on stackoverflow with more information, so I suggest you have a look at those. For example this, this or this post.

Dynamic dispatching of different SIMD implementations in header-only code. Possible at all?

1 Answers1