5

What is the best way to implement multiple versions of the same function that uses a specific CPU instructions if available (tested at run time), or falls back to a slower implementation if not?

For example, x86 BMI2 provides a very useful PDEP instruction. How would I write a C code such that it tests BMI2 availability of the executing CPU on startup, and uses one of the two implementations -- one that uses _pdep_u64 call (available with -mbmi2), and another that does bit manipulation "by hand" using C code. Are there any built-in support for such cases? How would I make GCC compile for older arch while providing access to the newer intrinsic? I suspect execution is faster if the function is invoked via a global function pointer, rather than an if/else every time?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Yuri Astrakhan
  • 8,808
  • 6
  • 63
  • 97
  • See how OS kernel does this (hint: for x86 it's a huge number of features based on what _cpuid_ instruction returns). – 0andriy Apr 03 '20 at 05:50

1 Answers1

6

You can declare a function pointer and point it to the correct version at program startup by calling cpuid to determine the current architecture

But it's better to utilize support from many modern compilers. Intel's ICC has automatic function dispatching to select the optimized version for each architecture long ago. I don't know the details but looks like it only applies to Intel's libraries. Besides it only dispatches to the efficient version on Intel CPUs, hence would be unfair to other manufacturers. There are many patches and workarounds for that in Agner`s CPU blog

Later a feature called Function Multiversioning was introduced in GCC 4.8. It adds the target attribute that you'll declare on each version of your function

__attribute__ ((target ("sse4.2")))
int foo() { return 1; }

__attribute__ ((target ("arch=atom")))
int foo() { return 2; }

int main() {
    int (*p)() = &foo;
    return foo() + p();
}

That duplicates a lot of code and is cumbersome so GCC 6 added target_clones that tells GCC to compile a function to multiple clones. For example __attribute__((target_clones("avx2","arch=atom","default"))) void foo() {} will create 3 different foo versions. More information about them can be found in GCC's documentation about function attribute

The syntax was then adopted by Clang and ICC. Performance can even be better than a global function pointer because the function symbols can be resolved at process loading time instead of runtime. It's one of the reasons Intel's Clear Linux runs so fast. ICC may also create multiple versions of a single loop during auto-vectorization

Here's an example from The one with multi-versioning (Part II) along with its demo which is about popcnt but you get the idea

__attribute__((target_clones("popcnt","default")))
int runPopcount64_builtin_multiarch_loop(const uint8_t* bitfield, int64_t size, int repeat) {
    int res = 0;
    const uint64_t* data = (const uint64_t*)bitfield;

    for (int r=0; r<repeat; r++)
    for (int i=0; i<size/8; i++) {
        res += popcount64_builtin_multiarch_loop(data[i]);
    }

    return res;
}

Note that PDEP and PEXT are very slow on current AMD CPUs so they should only be enabled on Intel

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • 1
    `popcount64_builtin_multiarch_loop` is a really weird name for a wrapper around `__builtin_popcountll`, because it's just a wrapper, not containing a loop. There's also no need for the wrapper function, it still compiles the same with just `__builtin_popcountll` inside the loop directly. (You might want to wrap it for portability to non-GNU compilers or whatever for other use-cases, e.g. using `std::bitset<64>::count` . But that's superfluous to this example that already only works in GNU C / C++, i.e. with gcc or clang. And compiles with ICC but doesn't make a version with `popcnt`.) – Peter Cordes Apr 03 '20 at 16:12
  • I have no idea why it was wrapped like that either. Probably the author thought that it's easier to port to other compilers? – phuclv Apr 03 '20 at 16:20
  • ICC: https://godbolt.org/z/d9BoHF. Oh, and clang doesn't multiversion either, but it does auto-vectorize with SSE2. Probably not profitably compared to popcnt, but maybe compared to its scalar fallback bithack. Anyway, clang10.0 says *unknown attribute 'target_clones'*, so not yet for clang with that compact syntax. – Peter Cordes Apr 03 '20 at 16:22
  • In the first example, what happens if I just call `foo()` on a platform that does not match either of the attributes? – Yuri Astrakhan Apr 03 '20 at 17:00
  • 1
    @Yurik I haven't checked that but typically you'll have to provide a "default" version – phuclv Apr 03 '20 at 17:26
  • @Yurik I think the dispatcher will search down the version list and selects the one that has highest priority – phuclv Apr 03 '20 at 17:35