3

I have written some AVX2 code to run on a Haswell i7 processor. The same codebase is also used on non-Haswell processors, where the same code should be replaced with their SSE equivalents. I was wondering is there a way for the compiler to ignore AVX2 instructions on non-Haswell processors. I need something like:

public void useSSEorAVX(...){
    IF (compiler directive detected AVX2)
        AVX2 code (this part is ready)
    ELSE
        SSE code  (this part is also ready)
    }
}

Right now I am commenting out related code before compiling but there must be some more efficient way to do this. I am using Ubuntu and gcc. Thanks for your help.

Alexandros
  • 2,160
  • 4
  • 27
  • 52
  • by "function" do you mean "cpu instruction"? – PlasmaHH May 15 '14 at 11:06
  • If you are using gcc you might be intrested in the target attribute. – PlasmaHH May 15 '14 at 11:15
  • You possibly severely overestimate the smarts of the compiler, running on *your* machine, to guess what the user's machine looks like. It can of course never be a "compiler directive". It has to be a runtime test, your CRT will wrap the CPUID instruction that tells you what the processor really looks like. You left no breadcrumbs, the specific CRT you use matters. – Hans Passant May 15 '14 at 11:17

2 Answers2

18

I don't think it's a good idea to make separate executable unless you have to. In your case you can make a CPU dispatcher. I did this recently for GCC and Visual studio.

Let's assume you have a function called product for SSE and AVX. You put the SSE version in a file product_SSE.cpp and the AVX2 version in a file product_AVX2.cpp. You compile each one separately (e.g. with -msse2 and -mavx2). Then make a module like this:

extern "C" void product_SSE(float *a, float *b, float *c, int n);
extern "C" void product_AVX2(float *a, float *b, float *c, int n); 
           void product_dispatch(float *a, float *b, float *c, int n); 
void (*fp)(float* a, float *b, float *c, int n) = product_dispatch;

inline void product_dispatch(float *a, float *b, float *c, int n) {
    int iset = instrset_detect();
    if(iset==8) {
        fp = product_AVX2
    }
    else {
        fp = product_SSE
    }
    fp(a,b,c,n);
}

inline void product(float *a, float *b, float*c, int bs) {
    fp(a,b,c,n);
}

You compile that module with the lower common instruction set (e.g. with SSE2). Now when you call product it first calls product_dispatch sets the function pointer fpto either product_AVX2 or product_SSE and then calls the function from the function pointer. The second time you call productit jumps right to product_AVX2or product_SSE. This way you don't have to have separate executable.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 1
    +1: nice idea, but a bit of a pain if you have more than a few SIMD functions - maybe it could be extended to a single function pointer *table* to reduce the amount of duplicated boilerplate code ? – Paul R May 15 '14 at 12:17
  • 2
    Compilation and running will always be on the same machine. So, there is no need to cross-copy binaries. Also your solution assumes that compilation will be done on the AVX2 machine (otherwise the AVX file will not compile). But +1 anyway for learning me new things – Alexandros May 15 '14 at 12:18
  • @PaulR, one of you two forgot to give me my +1 :-) I agree it would be a pain for many SIMD functions (I'm only using it for two: a float and a double version) right now. A table would probably be a good idea in general. – Z boson May 15 '14 at 12:23
  • I think Intel does something like this in its optimised libraries (IPP, MKL, etc). – Paul R May 15 '14 at 12:24
  • 1
    @PaulR, yeah I think you're right. I got the idea from Agner Fog. It's in his file dispatch_example.cpp in his Vector Class Library. I had some trouble getting it working in Visual Studio originally but it works fine now. He has 10 pages on writing dispatchers in this Optimizing C++ manual. Intel's dispatcher looks for Intel ID and points the function to a sub optimal one if it find a non-Intel processor. So it's better to write your own dispatcher. – Z boson May 15 '14 at 12:29
  • 1
    we do not forget to give you your +1. I think you forgot to give @PaulR your +1 for his excellent answer :-) – Alexandros May 15 '14 at 13:10
  • @Alexandros, don't worry, I have upvoted PaulR probably more times than anyone else on SO. – Z boson May 16 '14 at 12:55
  • 2
    x264 uses tables of function pointers for every routine that has an asm version available for some instruction set. `if (cpu_has_sse3) { block of code setting pointers to all available sse3 routines; } if (cpu_has_avx) { set pointers to all available avx routines; }`. So you get the best available version of any routine. – Peter Cordes May 01 '15 at 06:08
  • However, indirect calls through function pointers have slightly more overhead than direct calls (they depend more on the the branch target buffer to avoid bubbles in the pipeline due to stalled instruction fetch on BTB miss). Using this technique for something like memcpy that's called from all over the place would waste a lot of BTB entries, making branch prediction worse in general. – Peter Cordes May 01 '15 at 06:14
  • For a routine called from all over the place, you *could* modify the CODE instead of the pointers, once at app startup. So all the function calls to your CPU-optimized custom memcpy would be to an address in writeable memory that is also executable. (same setup you'd need for JIT and other on-the-fly code-gen). At startup, your program would overwrite the generic custom_memcpy implementation with one for the detected CPU, then mark that memory page read-only. Leave enough padding for the longest. (detecting at startup means you'll have problems in a VM that migrates to a different CPU.) – Peter Cordes May 01 '15 at 06:19
  • @PeterCordes, I have not does something like this before. Are you talking about self modifying code? – Z boson May 04 '15 at 08:48
  • Yes. But only once, at startup. There'd be some linker trickery to arrange for your code to compile with all the calls to this function actually pointing at a buffer in writeable memory that you were going to copy the CPU-specific code into. IDK how one would actually go about that. I'd probably just use the system `memcpy` :P And maybe call (through a function pointer if needed) a custom-tuned copy routine from a hotspot, calling a routine that was tuned for the size of copy that hotspot used. – Peter Cordes May 04 '15 at 17:10
  • @PeterCordes, I like your idea, let me think about it. Thank you. – Z boson May 05 '15 at 07:46
  • @Zboson can you explain why the `extern "C"` is needed. – cyrusbehr Dec 04 '20 at 00:17
  • @cyrusbehr I don't know if it is. Try it without out. It uses C function names instead of C++ function names which are mangled and less convenient for debugging. – Z boson Dec 04 '20 at 09:08
5

If you just want to do this at compile-time then you can do this:

#ifdef __AVX2__
    // AVX2 code
#elif __SSE__
    // SSE code
#else
    // scalar code
#endif

Note that when you compile with gcc -mavx2 ... then __AVX2__ gets defined automatically. Similarly for __SSE__. (Note also that you can check what's pre-defined by your compiler for any given command line switching using the incantation gcc -dM -E -mavx2 - < /dev/null.)

If you want to do run-time dispatching though then that's a little more complicated.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    +1 Right now, I have -march=native and -mtune=native on my makefile but adding the -mavx2 on the makefile used on the Haswell processor would be no problem. Once tested I will accept your answer – Alexandros May 15 '14 at 11:30
  • 1
    You may not need to explicitly add the `-mavx2` switch - check whether `-march=native -mtune=native` already does this implicitly for you using e.g. `gcc -dM -E -mmarch=native -mtune=native - < /dev/null | grep AVX`. – Paul R May 15 '14 at 11:39
  • 1
    Yes it does. So, no need to even change the makefile. Thanks – Alexandros May 15 '14 at 11:42