Automatically use AVX/SSE if available at runtime?

Question

Dupe of Have different optimizations (plain, SSE, AVX) in the same executable with C/C++

The "Auto-duplicate" think picked the wrong suggested duplicate, and I don't seem to have the interface to fix it.

Is there any way to build a application that will optionally use instruction set extensions if available, and yet still function (albeit more slowly) in their absence?

Right now, I have a MSVC++ radar imaging application that does a lot of math that can benefit from vectorization. If I compile the application with AVX, it then crashes at application start when run on a platform missing those CPU functions.

Is there any way to have the compiler generate both AVX accelerated and normal instructions, and switch between them at runtime? It seems like the only other option is to have two complete builds of the application, and chose which to install based on the CPU architecture, but that sounds like more trouble then it's worth.

A set of dynamic load libraries could be appropriate. After detecting the cpu features, a [tag:dll] would be loaded ([LoadLibrary](https://msdn.microsoft.com/en-us/library/windows/desktop/ms684175(v=vs.85).aspx) or [LoadLibraryEx](https://msdn.microsoft.com/en-us/library/windows/desktop/ms684179(v=vs.85).aspx)) — J.J. Hakala, Jul 11 '16 at 23:35

Jerry Coffin · Answer 1 · 2016-07-12T00:12:40.917

At least with the compilers of which I'm aware, no, they don't generate code to automatically detect and take advantage of the available instruction set automatically (at least from completely portable code--see below for more).

Rather than two complete builds of the application, it often makes sense to move the numerical, CPU-intensive parts of the application into a DLL, and only choose those at run-time. This lets the UI and things like that (that don't really benefit from special instructions) live in a common area, and only a few specific pieces you care about get switched out at run-time.

Since you tagged C++, I'll mention one possible way of managing this: define an abstract base class that defines the interface to your numerical routines. Then define a derived class for each instruction set you care about.

Then at run-time, you have a pointer to the base. You initialization code checks the available CPU, and initializes that pointer to point to an instance of the correct type of derived object. From that point onward, you just interact with the routines via that pointer.

To avoid that adding excessive overhead, you generally want to define your functions in that class (or those classes, as the case may be) to each do quite a bit of work, to amortize the added cost of of a virtual function call over a larger amount of work.

You don't normally have to get involved with things like directly calling LoadLibrary though. You can use the /Delayload linker flag to tell it to only load the DLLs when you actually call a function in the DLL. Then only the DLL corresponding to the derived class(es) you instantiate will be loaded.

Another obvious possibility would be to use a library/compiler extension like Cilk Plus to manage most of the vectorizing for you. With this, you can (for example) write things like:

a[:] = b[:] + c[:];

...to get something roughly equivalent to:

for (int i=0; i<size; i++)
    a[i] = b[i] + c[i];

The compiler links to the Cilk Plus library, which handles checking what instruction sets the CPU supports, and uses them as appropriate.

I can think of ways to hack around the issue, I was *really* hoping there'd be some compiler-related way to generate multiple codepaths, and have the compiler hide everything. It seems like it should be possible to have a compiler pragma or similar that can generate multiple versions of a function for different CPUs, and chose the right one at runtime. — Fake Name, Jul 12 '16 at 00:51

Automatically use AVX/SSE if available at runtime?

1 Answers1