cpu dispatcher for visual studio for AVX and SSE

Question

I work with two computers. One without AVX support and one with AVX. It would be convenient to have my code find the instruction set supported by my CPU at run-time and choose the appropriate code path. I've follow the suggestions by Agner Fog to make a CPU dispatcher (http://www.agner.org/optimize/#vectorclass). However, on my maching ithout AVX compiling and linking with visual studio the code with AVX enabled causes the code to crash when I run it.

I mean for example I have two source files one with the SSE2 instruction set defined with some SSE2 instructions and another one with the AVX instruction set defined and with some AVX instructions. In my main function if I only reference the SSE2 functions the code still crashes by virtue of having any source code with AVX enabled and with AVX instructions. Any clues to how I can fix this?

Edit: Okay, I think I isolated the problem. I'm using Agner Fog's vector class and I have defined three source files as:

//file sse2.cpp - compiled with /arch:SSE2
#include "vectorclass.h"
float func_sse2(const float* a) {
    Vec8f v1 = Vec8f().load(a);
    float sum = horizontal_add(v1);
    return sum;
}
//file avx.cpp - compiled with /arch:AVX
#include "vectorclass.h"
float func_avx(const float* a) {
    Vec8f v1 = Vec8f().load(a);
    float sum = horizontal_add(v1);
    return sum;
}
//file foo.cpp - compiled with /arch:SSE2
#include <stdio.h>
extern float func_sse2(const float* a);
extern float func_avx(const float* a);
int main() {
    float (*fp)(const float*a); 
    float a[] = {1,2,3,4,5,6,7,8};
    int iset = 6;
    if(iset>=7) { 
        fp = func_avx;  
    }
    else { 
        fp = func_sse2;
    }
    float sum = (*fp)(a);
    printf("sum %f\n", sum);
}

This crashes. If I instead use Vec4f in func_SSE2 it does not crash. I don't understand this. I can use Vec8f with SSE2 by itself as long as I don't have another source file with AVX. Agner Fog's manual says

"There is no advantage in using the 256-bit floating point vector classes (Vec8f, Vec4d) unless the AVX instruction set is specified, but it can be convenient to use these classes anyway if the same source code is used with and without AVX. Each 256-bit vector will simply be split up into two 128-bit vectors when compiling without AVX."

However, when I have two source files with Vec8f one compiled with SSE2 and one compiled with AVX then I get a crash.

Edit2: I can get it to work from the command line

>cl -c sse2.cpp
>cl -c /arch:AVX avx.cpp
>cl foo.cpp sse2.obj avx.obj
>foo.exe

Edit3: This, however, crashes

>cl -c sse2.cpp
>cl -c /arch:AVX avx.cpp
>cl foo.cpp avx.obj sse2.obj
>foo.exe

Another clue. Apparently, the order of linking matters. It crashes if avx.obj is before sse2.obj but if sse2.obj is before avx.obj it does not crash. I'm not sure if it chooses the correct code path (I don't have access to my AVX system right now) but at least it does not crash.

What are the details of the crash? Have you identified the failing instruction in a debugger? — Stephen Canon, Mar 14 '13 at 17:58
Well the debugger shows that func_SSE is trying to use the AVX instructions. I don't know why. But I managed to get the code to work without crashing using the command line. I added the commands above. I still don't know how to do it with the IDE. On the plus side I compiled from the command line for the first time in Windows! It's the only way I compile on Linux. — , Mar 14 '13 at 19:36
No, I'm not using link-time code-generation. But I tried it and it made no difference. — , Mar 15 '13 at 07:51
I found another clue. It crashes if avx.obj is before sse2.obj but it works otherwise. — , Mar 15 '13 at 07:53
Since you did not provide full code it's hard to guess the cause but I think the crash might be also related to alignment. — Igor Levicki, Jul 03 '13 at 13:55
What's teh conclusion, does Agner's library do Run Time Dispatch? — Royi, Feb 18 '18 at 14:10

score 9 · Answer 1 · edited Apr 04 '17 at 13:13

I realise that this is an old question and that the person who asked it appears to be no longer around, but I hit the same problem yesterday. Here's what I worked out.

When compiled both your sse2.cpp and avx.cpp files produce object files that not only contain your function but also any required template functions. (e.g. Vec8f::load) These template functions are also compiled using the requested instruction set.

The means that your sse2.obj and avx.obj object files will both contain definitions of Vec8f::load each compiled using the respective instruction sets.

However, since the compiler treats Vec8f::load as externally visible, it puts it a 'COMDAT' section of the object file with a 'selectany' (aka 'pick any') label. This tells the linker that if it sees multiple definitions of this symbol, for example in 2 different object files, then it is allowed to pick any one it likes. (It does this to reduce duplicate code in the final executable which otherwise would be inflated in size by multiple definitions of template and inline functions.)

The problem you are having is directly related to this in that the order of the object files passed to the linker is affecting which one it picks. Specifically here, it appears to be picking the first definition it sees.

If this was avx.obj then the AVX compiled version of Vec8F::load will always be used. This will crash on a machine that doesn't support that instruction set. On the other hand if sse2.obj is first then the SSE2 compiled version will always be used. This won't crash but it will only use SSE2 instructions even if AVX is supported.

That this is the case can be seen if you look at the linker 'map' file output (produced using the /map option.) Here are the relevant (edited) excerpts -

//
// link with sse2.obj before avx.obj
//
0001:00000080  _main                             foo.obj
0001:00000330  func_sse2@@YAMPBM@Z               sse2.obj
0001:00000420  ??0Vec256fe@@QAE@XZ               sse2.obj
0001:00000440  ??0Vec4f@@QAE@ABT__m128@@@Z       sse2.obj
0001:00000470  ??0Vec8f@@QAE@XZ                  sse2.obj <-- sse2 version used
0001:00000490  ??BVec4f@@QBE?AT__m128@@XZ        sse2.obj
0001:000004c0  ?get_high@Vec8f@@QBE?AVVec4f@@XZ  sse2.obj
0001:000004f0  ?get_low@Vec8f@@QBE?AVVec4f@@XZ   sse2.obj
0001:00000520  ?load@Vec8f@@QAEAAV1@PBM@Z        sse2.obj <-- sse2 version used
0001:00000680  ?func_avx@@YAMPBM@Z               avx.obj
0001:00000740  ??BVec8f@@QBE?AT__m256@@XZ        avx.obj

//
// link with avx.obj before sse2.obj
//
0001:00000080  _main                             foo.obj
0001:00000270  ?func_avx@@YAMPBM@Z               avx.obj
0001:00000330  ??0Vec8f@@QAE@XZ                  avx.obj <-- avx version used
0001:00000350  ??BVec8f@@QBE?AT__m256@@XZ        avx.obj
0001:00000380  ?load@Vec8f@@QAEAAV1@PBM@Z        avx.obj <-- avx version used
0001:00000580  ?func_sse2@@YAMPBM@Z              sse2.obj
0001:00000670  ??0Vec256fe@@QAE@XZ               sse2.obj
0001:00000690  ??0Vec4f@@QAE@ABT__m128@@@Z       sse2.obj
0001:000006c0  ??BVec4f@@QBE?AT__m128@@XZ        sse2.obj
0001:000006f0  ?get_high@Vec8f@@QBE?AVVec4f@@XZ  sse2.obj
0001:00000720  ?get_low@Vec8f@@QBE?AVVec4f@@XZ   sse2.obj

As for fixing it, that's another matter. In this case, the following blunt hack should work by forcing the avx version to have its own differently named versions of the template functions. This will increase the resulting executable size as it will contain multiple versions of the same function even if the sse2 and avx versions are identical.

// avx.cpp
namespace AVXWrapper {
\#include "vectorclass.h"
}
using namespace AVXWrapper;

float func_avx(const float* a)
{
    ...
}

There are some important limitations though - (a) if the included file manages any form of global state it will no longer be truly global as you will have 2 'semi-global' versions, and (b) you won't be able to pass vectorclass variables as parameters between other code and functions defined in avx.cpp.

If functions are identical the linker is able to 'fold' them together and remove the redundant one. — Bas, Oct 08 '22 at 10:31

score 2 · Answer 2 · answered Mar 15 '13 at 11:58

2

The fact that the link order matters makes me think that there might be some kind of initialization code in the obj file. If the initialization code is communal, then only the first one is taken. I can't reproduce it, but you should be able to see it in an assembly listing (compile with /c /Ftestavx.asm)

answered Mar 15 '13 at 11:58

A Fog

4,360
1
30
32

The dispatcher does not crash on my system with AVX but it crashes on my system without. Can you test it on a system without AVX? Maybe the system on AVX is not choosing the SSE instructions either but since it has AVX it still works? The assembly listing is a bit too advanced for me right now so it's likely I will have to come back to this. – Mar 18 '13 at 10:45

score 1 · Answer 3 · answered Mar 14 '13 at 10:55

1

Put the SSE and AVX functions in different CPP files and be sure to compile SSE version wihout /arch:AVX.

answered Mar 14 '13 at 10:55

Marat Dukhan

11,993
4
27
41

That's exactly what I did. – Mar 14 '13 at 11:58
That just run it under debugger. When the CPU will generate "invalid instruction" exception you will see the origin of this instruction. It is likely that your non-AVX CPU doesn't support some SSE instructions you use. There are many generations of SSE instructions: SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, and SSE4A (includes SSE3, but not SSSE3, SSE4.1 or SSE4.2). – Marat Dukhan Mar 14 '13 at 12:10
My CPU supports up to SSE4.2. I checked it with CPU-Z. But I'm now trying a stripped down version of the code without the vectorclass and it's working. I'll have to get back to you... – Mar 14 '13 at 12:30
I added some text to my question that may help explain things. – Mar 14 '13 at 13:17

cpu dispatcher for visual studio for AVX and SSE

3 Answers3

Linked

Related