C++ Centralizing SIMD usage

Question

i have a library and a lot of projects depending on that library. I want to optimize certain procedures inside the library using SIMD extensions. However it is important for me to stay portable, so to the user it should be quite abstract. I say at the beginning that i dont want to use some other great library that does the trick. I actually want to understand if that what i want is possible and to what extent.

My very first idea was to have a "vector" wrapper class, that the usage of SIMD is transparent to the user and a "scalar" vector class could be used in case no SIMD extension is available on the target machine. The naive thought came to my mind to use the preprocessor to select one vector class out of many depending on which target the library is compiled. So one scalar vector class, one with SSE (something like this basically: http://fastcpp.blogspot.de/2011/12/simple-vector3-class-with-sse-support.html) and so on... all with the same interface. This gives me good performance but this would mean that i would have to compile the library for any kind of SIMD ISA that i use. I rather would like to evaluate the processor capabilities dynamically at runtime and select the "best" implementation available.

So my second guess was to have a general "vector" class with abstract methods. The "processor evaluator" function would than return instances of the optimal implementation. Obviously this would lead to ugly code, but the pointer to the vector object could be stored in a smart pointer-like container that just delegates the calls to the vector object. Actually I would prefer this method because of its abstraction but I'm not sure if calling the virtual methods actually will kill the performance that i gain using SIMD extensions.

The last option that i figured out would be to do optimizations whole routines and select at runtime the optimal one. I dont like this idea so much because this forces me to implement whole functions multiple times. I would prefer to do this once, using my idea of the vector class i would like to do something like this for example:

void Memcopy(void *dst, void *src, size_t size)
{
    vector v;
    for(int i = 0; i < size; i += v.size())
    {
        v.load(src);
        v.store(dst);
        dst += v.size();
        src += v.size();
    }
}

I assume here that "size" is a correct value so that no overlapping happens. This example should just show what i would prefer to have. The size-method of the vector object would for example just return 4 in case SSE is used and 1 in case the scalar version is used. Is there a proper way to implement this using only runtime information without loosing too much performance? Abstraction is to me more important than performance but as this is a performance optimization i wouldn't include it if would not speedup my application.

I also found this on the web: http://compeng.uni-frankfurt.de/?vc Its open source but i dont understand how the correct vector class is chosen.

Why are you so concerned about portability? What platforms are you reasonably going to be compiling on? Also what "certain procedures" are you optimizing? It's probable there's a library out there that does exactly what you want, or gets you 50% of the way there. See also [this question about SIMD](http://stackoverflow.com/questions/981787/good-portable-simd-library). — tadman, Oct 01 '15 at 18:49
Select a class at compile time, compile for several targets as a dll/shsred lib, then select and load the right version at runtime. — n. m. could be an AI, Oct 01 '15 at 19:12
The Intel compiler apparently implements this, transparently. It checks the features of your (Intel) CPU and dynamically switches implementations. — MSalters, Oct 02 '15 at 08:52

Peter Cordes · Answer 1 · 2015-10-01T19:05:11.153

5

Your idea will only compile to efficient code if everything inlines at compile time, which is incompatible with runtime CPU dispatching. For v.load(), v.store(), and v.size() to actually be different at runtime depending on the CPU, they'd have to be actual function calls, not single instructions. The overhead would be killer.

If your library has functions that are big enough to work without being inlined, then function pointers are great for dispatching based on runtime CPU detection. (e.g. make multiple versions of memcpy, and pay the overhead of runtime detection once per call, not twice per loop iteration.)

This shouldn't be visible in your library's external API/ABI, unless your functions are mostly so short that the overhead of an extra (direct) call/ret matters. In the implementation of your library functions, put each sub-task that you want to make a CPU-specific version of into a helper function. Call those helper functions through function pointers.

Start with your function pointers initialized to versions that will work on your baseline target. e.g. SSE2 for x86-64, scalar or SSE2 for legacy 32bit x86 (depending on whether you care about Athlon XP and Pentium III), and probably scalar for non-x86 architectures. In a constructor or library init function, do a CPUID and update the function pointers to the best version for the host CPU. Even if your absolute baseline is scalar, you could make your "good performance" baseline something like SSSE3, and not spend much/any time on SSE2-only routines. Even if you're mostly targetting SSSE3, some of your routines will probably end up only requiring SSE2, so you might as well mark them as such and let the dispatcher use them on CPUs that only do SSE2.

Updating the function pointers shouldn't even require any locking. Any calls that happen from other threads before your constructor is done setting function pointers may get the baseline version, but that's fine. Storing a pointer to an aligned address is atomic on x86. If it's not atomic on any platform where you have a version of a routine that needs runtime CPU detection, use C++ std:atomic (with memory-order relaxed stores and loads, not the default sequential consistency which would trigger a full memory barrier on every load). It matters a lot that there's minimal overhead when calling through the function pointers, and it doesn't matter what order different threads see the changes to the function pointers. They're write-once.

x264 (the heavily-optimized open source h.264 video encoder) uses this technique extensively, with arrays of function pointers. See x264_mc_init_mmx(), for example. (That function handles all CPU dispatching for Motion Compensation functions, from MMX to AVX2). I assume libx264 does the CPU dispatching in the "encoder init" function. If you don't have a function that users of your library are required to call, then you should look into some kind of mechanism for running global constructor / init functions when programs using your library start up.

If you want this to work with very C++ey code (C++ish? Is that a word?) i.e. templated classes & functions, the program using the library will probably have do the CPU dispatching, and arrange to get baseline and multiple CPU-requirement versions of functions compiled.

edited Oct 01 '15 at 19:05

answered Oct 01 '15 at 18:52

Peter Cordes

328,167
45
605
847

Why would you recommend pointers when the binary that results from compiling is intrinsically locked to a particular architecture? One header file can have multiple independent implementations, only one of which is necessary to be compiled and linked in. – tadman Oct 01 '15 at 18:59
@tadman: We're talking about runtime dispatching to make a binary that can run on any machine of the target architecture, but will take advantage of features that aren't baseline for the architecture (e.g. run on any x86 CPU, but run faster on Haswell and newer with AVX2. Running AVX2 instructions on CPUs that don't support them will cause an Illegal Instruction exception (x86 #UD exception -> Unix SIGILL). There *is* an instruction you can run on any x86 to find out which extensions are present, though.). Same idea for ARM NEON, since not all ARM CPUs have a vector unit. – Peter Cordes Oct 01 '15 at 19:07
It's a few cases where you can make a binary that has different SIMD options. ARM is a separate binary than x86 than x64. It's only the x86 case where this is a real hassle, and in that case you might want to produce two dynamically linked libraries and load in the correct one as necessary. – tadman Oct 01 '15 at 19:13
@tadman: The OP already said he'd rather do runtime dispatch and ship one library (for each architecture). It would be possible, but annoying, to make a SSE2 library, an SSSE3 library, an SSE4.1 library, an AVX library, and an AVX2 library. (omitting SSE3 and SSE4.2 as unlikely to be worth making separate versions for). Each of these libraries would need a 32bit-x86 and an x86-64 version. You'd also need two ARM libraries, one scalar and one with NEON. And two versions for every arch where you have accelerated versions of any function. – Peter Cordes Oct 01 '15 at 19:29
Also, yes, of course you need different binaries for different target architectures. Stop bringing up this point, it's too trivial to be worth talking about. Of course you can't run x86 binaries on ARM CPUs. – Peter Cordes Oct 01 '15 at 19:34
All of this sounds like an exercise in futility since I'm extremely skeptical that any of this would perform better than using an off the shelf library. The differences in SSE are important, but irrelevant if the instructions don't benefit the task at hand. I'm just proposing the multiple library solution as one that's cleaner from a runtime perspective even if messier to produce. If that's not an option for whatever reason, that's fine. It's just worth mentioning. Even n.m seems to think this is a better approach. – tadman Oct 01 '15 at 19:34
@tadman The OP's suggestion of memcpy, and the terrible implementation idea, do raise the possibility that there might not be any gains to be had for the OP. However, the question as asked has a useful answer. Video encoder libraries like libx264 are a great example of libraries that do benefit a *lot* from CPU extensions, and this is how they do it. IIRC, ATLAS (a matrix linear algebra library) uses the separate-binaries idea, which makes sense because it's designed to hard-code the host's L1/L2 cache sizes at compile time as well. – Peter Cordes Oct 01 '15 at 19:41
That's why I'm skeptical that function dispatch alone can crack this nut. You may need to apply different compiler flags to certain platform-specific dynamic libraries to get them to work at optimal levels. Although setting up a build that works this way is a hassle and a half, the run-time code is super simple. Detect and load the right library, then carry on as if everything was normal from the start. – tadman Oct 01 '15 at 19:43
@tadman: matmul is as much about optimizing for cache as actual CPU instruction throughput. Even if you did need different compiler options (instead of just different compile-time `#define`s), there's no reason you can't get the differently-compiled versions into the same library for runtime dispatch if you want to do it that way. Runtime dispatch with function pointers if a valid option. I'm not saying it's way better than the alternatives, esp. not for every case. You're talking like it's way more complicated and way worse. – Peter Cordes Oct 01 '15 at 19:52
There's nothing wrong the the multiple-binaries approach, except that it's not totally transparent to users of your library. The user of the library has to do the runtime detection and dynamically load it. I'd be interested in seeing an answer that hides all that from the program using the library. At least on POSIX systems, using `dlopen` runtime dynamic loading instead of normal dynamic linking means you need to get function pointers for the library functions you want to call. You can't just use them directly and have them resolved by the dynamic linker when your program loads. – Peter Cordes Oct 01 '15 at 19:53
I should use an array of function pointers (rather than naming them each individually) but you see it started off with few pointers and then grew to many and I have not been bothered enough yet the code simpler. – Z boson Oct 06 '15 at 10:38

score 1 · Accepted Answer · edited May 23 '17 at 12:29

I do exactly this with a fractal project. It works with vector sizes of 1, 2, 4, 8, and 16 for float and 1, 2, 4, 8 for double. I use a CPU dispatcher at run-time to select the following instructions sets: SSE2, SSE4.1, AVX, AVX+FMA, and AVX512.

The reason I use a vector size of 1 is to test performance. There is already a SIMD library that does all this: Agner Fog's Vector Class Library. He even includes example code for a CPU dispatcher.

The VCL emulates hardware such as AVX on systems that only have SSE (or even AVX512 for SSE). It just implements AVX twice (for four times for AVX512) so in most cases you can just use the largest vector size you want to target.

//#include "vectorclass.h"
void Memcopy(void *dst, void *src, size_t size)
{
    Vec8f v; //eight floats using AVX hardware or AVX emulated with SSE twice.
    for(int i = 0; i < size; i +=v.size())
    {
        v.load(src);
        v.store(dst);
        dst += v.size();
        src += v.size();
    }
}

(however, writing an efficient memcpy is complicating. For large sizes you should consider non temroal stores and on IVB and above use rep movsb instead). Notice that that code is identical to what you asked for except I changed the word vector to Vec8f.

Using the VLC, as CPU dispatcher, templating, and macros you can write your code/kernel so that it looks nearly identical to scalar code without source code duplication for every different instruction set and vector size. It's your binaries which will be bigger not your source code.

I have described CPU dispatchers several times. You can also see some example using templateing and macros for a dispatcher here: alias of a function template

Edit: Here is an example of part of my kernel to calculate the Mandelbrot set for a set of pixels equal to the vector size. At compile time I set TYPE to float, double, or doubledouble and N to 1, 2, 4, 8, or 16. The type doubledouble is described here which I created and added to the VCL. This produces Vector types of Vec1f, Vec4f, Vec8f, Vec16f, Vec1d, Vec2d, Vec4d, Vec8d, doubledouble1, doubledouble2, doubledouble4, doubledouble8.

template<typename TYPE, unsigned N>
static inline intn calc(floatn const &cx, floatn const &cy, floatn const &cut, int32_t maxiter) {
    floatn x = cx, y = cy;
    intn n = 0; 
    for(int32_t i=0; i<maxiter; i++) {
        floatn x2 = square(x), y2 = square(y);
        floatn r2 = x2 + y2;
        booln mask = r2<cut;
        if(!horizontal_or(mask)) break;
        add_mask(n,mask);
        floatn t = x*y; mul2(t);
        x = x2 - y2 + cx;
        y = t + cy;
    }
    return n;
}

So my SIMD code for several several different data types and vector sizes is nearly identical to the scalar code I would use. I have not included the part of my kernel which loops over each super-pixel.

My build file looks something like this

g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse2          -Ivectorclass  kernel.cpp -okernel_sse2.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse4.1        -Ivectorclass  kernel.cpp -okernel_sse41.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx           -Ivectorclass  kernel.cpp -okernel_avx.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx2 -mfma    -Ivectorclass  kernel.cpp -okernel_avx2.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx2 -mfma    -Ivectorclass  kernel_fma.cpp -okernel_fma.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx512f -mfma -Ivectorclass  kernel.cpp -okernel_avx512.o
g++ -m64 -Wall -Wextra -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse2 -Ivectorclass frac.cpp vectorclass/instrset_detect.cpp kernel_sse2.o kernel_sse41.o kernel_avx.o kernel_avx2.o kernel_avx512.o kernel_fma.o -o frac

Then the dispatcher looks something like this

int iset = instrset_detect();
fp_float1  = NULL; 
fp_floatn  = NULL;
fp_double1 = NULL;
fp_doublen = NULL;
fp_doublefloat1  = NULL;
fp_doublefloatn  = NULL;
fp_doubledouble1 = NULL;
fp_doubledoublen = NULL;
fp_float128 = NULL;
fp_floatn_fma = NULL;
fp_doublen_fma = NULL;

if (iset >= 9) {
    fp_float1  = &manddd_AVX512<float,1>;
    fp_floatn  = &manddd_AVX512<float,16>;
    fp_double1 = &manddd_AVX512<double,1>;
    fp_doublen = &manddd_AVX512<double,8>;
    fp_doublefloat1  = &manddd_AVX512<doublefloat,1>;
    fp_doublefloatn  = &manddd_AVX512<doublefloat,16>;
    fp_doubledouble1 = &manddd_AVX512<doubledouble,1>;
    fp_doubledoublen = &manddd_AVX512<doubledouble,8>;
}
else if (iset >= 8) {
    fp_float1  = &manddd_AVX<float,1>;
    fp_floatn  = &manddd_AVX2<float,8>;
    fp_double1 = &manddd_AVX2<double,1>;
    fp_doublen = &manddd_AVX2<double,4>;
    fp_doublefloat1  = &manddd_AVX2<doublefloat,1>;
    fp_doublefloatn  = &manddd_AVX2<doublefloat,8>;
    fp_doubledouble1 = &manddd_AVX2<doubledouble,1>;
    fp_doubledoublen = &manddd_AVX2<doubledouble,4>;
}
....

This sets function pointers to each of the different possible datatype vector combination for the instruction set found at runtime. Then I can call whatever function I'm interested.

score 0 · Answer 3 · answered Oct 02 '15 at 16:27

Thanks Peter Cordes and Z boson. With your both replies I I came to a solution that satisfies me. I chose the Memcopy just as an example just because of everyone knowing it and its beautiful simplicity (but also slowness) when implemented naively in contrast to SIMD optimizations that are often not well readable anymore but of course much faster. I have now two classes (more possible of course) a scalar vector and an SSE vector both with inline methods. To the user i show something like: typedef void(*MEM_COPY_FUNC)(void *, const void *, size_t);

extern MEM_COPY_FUNC memCopyPointer;

I declare my function something like this, as Z boson pointed out: template void MemCopyTemplate(void *pDest, const void *prc, size_t size) { VectorType v; byte *pDst, *pSrc; uint32 mask;

    pDst = (byte *)pDest;
    pSrc = (byte *)prc;

    mask = (2 << v.GetSize()) - 1;
    while(size & mask)
    {
        *pDst++ = *pSrc++;
    }

    while(size)
    {
        v.Load(pSrc);
        v.Store(pDst);

        pDst += v.GetSize();
        pSrc += v.GetSize();
        size -= v.GetSize();
    }
}

And at runtime, when the library is loaded, i use CPUID to do either

memCopyPointer = MemCopyTemplate<ScalarVector>;

or

memCopyPointer = MemCopyTemplate<SSEVector>;

as you both suggested. Thanks a lot.

oh and its while(size & mask) { *pDst++ = *pSrc++; size--; } — acensored, Oct 02 '15 at 16:30

C++ Centralizing SIMD usage

3 Answers3