11

How can I disable auto-vectorization with AVX and FMA instructions? I would still prefer the compiler to employ SSE and SSE2 automatically, but not FMA and AVX.

My code that uses AVX checks for its availability, but GCC doesn't do it when auto-vectorizing. So if I compile with -mfma and run the code on any CPU prior to Haswell I get SIGILL. How to solve this issue?

Violet Giraffe
  • 32,368
  • 48
  • 194
  • 335
  • 3
    What you want is a CPU dispatcher. Compile your code with `-msse2` into one object file and `-mfma` into another then create a dispatcher which asks CPUID which hardware is available and set your function pointers to point to either the SSE2 or AVX/FMA versions https://stackoverflow.com/questions/23676426/disable-avx2-functions-on-non-haswell-processors/23677889#23677889 – Z boson Sep 18 '14 at 10:56
  • @Zboson: Yep, that's what I did. Make it an answer and I will accept. – Violet Giraffe Sep 18 '14 at 11:37
  • 6
    Recent gcc lets you use intrinsics without -mfma or -mavx. You can also specify the target per function. – Marc Glisse Sep 19 '14 at 08:09

2 Answers2

10

What you want to do is compile different object files for each instruction set you are targeting. Then create a cpu dispatcher which asks CPUID for the available instruction set and then jumps to the appropriate version of the function.

I already described this in several different questions and answers

Z boson
  • 32,619
  • 11
  • 123
  • 226
2

You will need to separate the code that uses AVX into a separate compile unit (in other words, a separate .cpp file), and compile only that with -mfma or whatever options you want. Normally, gcc will use -march=native, so it will compile for "your processor", and if you want generic code, you will need to use -march=x86_64 or -march=core2, or something like that.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • So if I have a template class that uses AVX (all the code is in a header file) I'm in trouble? – Violet Giraffe Sep 18 '13 at 09:32
  • Yes, and it'd have to be pretty well written to avoid paying a penalty for checking if you have AVX every time you call one of those functions. Have you actually benchmarked it? – Mats Petersson Sep 18 '13 at 09:44
  • I have benchmarked it, the speedup compared to SSE2 is small but it's there. Of course I don't call `cpuid` every time, only once on startup. – Violet Giraffe Sep 18 '13 at 09:52
  • Are you sure that there's no separate option for disabling auto-vectorization? – Violet Giraffe Sep 18 '13 at 09:54
  • Yes, of course, the CPUID isn't going to change, but you will have "if (has_avx) do_avx_stuff; else do_sse2_stuff;` or soemthing like that - and if it's going to work on different platforms, it will have to be a runtime decision. In my experience (although this was SSE and 3DNow!/MMX type stuff), it's better to take a larger chunk of code and convert the whole thing. Tiny functions that only do a few steps aren't worth converting like that. – Mats Petersson Sep 18 '13 at 09:54
  • 2
    I'm sure that `gcc` will pick "any" instruction it thinks is workable on the architecture you are compiling for, based on what it thinks is best. So it MAY use an AVX instruction if there is a clever one to solve a particular construct. And even if your current version doesn't, there is nothing saying that version `current + 0.0.1` doesn't do that as some sort of improvement. – Mats Petersson Sep 18 '13 at 09:56
  • @VioletGiraffe: One approach could be to explicitly instantiate them in a source file (i.e. a `.cpp` file) that you compile using `-mavx` (which obviously will need to see the full implementation of the template classes). Then, expose the interface of the template class only to other places where it's used. This will prevent any inlining of AVX-using instructions into places where it might not be usable. This, however, relies upon you using a reasonable, bounded number of combinations of template parameters (so that you can just enumerate all of them). – Jason R Sep 18 '13 at 15:16
  • @MatsPetersson: Another approach to putting `if (has_avx)` in your code is to use function pointers. You write a setup routine that decides which asm version of every function you have an optimized version of, and then in code that wants to use them, you just call through the function pointer for that func. (x264 makes extensive use of this to pick optimal versions of things depending on what CPU it's running on. There are a few cases where using a new instruction isn't a win, even on a CPU that supports it. e.g. initial core2 (conroe aka merom) had slow `pshufb`.) – Peter Cordes Jul 10 '15 at 04:53