What's the proper way to use different versions of SSE intrinsics in GCC?

Question

I will ask my question by giving an example. Now I have a function called do_something().

It has three versions: do_something(), do_something_sse3(), and do_something_sse4(). When my program runs, it will detect the CPU feature (see if it supports SSE3 or SSE4) and call one of the three versions accordingly.

The problem is: When I build my program with GCC, I have to set -msse4 for do_something_sse4() to compile (e.g. for the header file <smmintrin.h> to be included).

However, if I set -msse4, then gcc is allowed to use SSE4 instructions, and some intrinsics in do_something_sse3() is also translated to some SSE4 instructions. So if my program runs on CPU that has only SSE3 (but no SSE4) support, it causes "illegal instruction" when calls do_something_sse3().

Maybe I have some bad practice. Could you give some suggestions? Thanks.

I think the standard approach is to compile the different versions in separate compilation units. — Mysticial, Mar 23 '13 at 08:59
@Mysticial, first thank you for editing my question. As I understand, "compile the different versions in separate compilation units" means: put all `do_things_sse4` in a file `functios_sse4.c`, and compile it with the option `-msse4`; and compile `functions_sse3.c` with `-msse3`. I will try this. (I may need to reconstruct my codes, which were originally written for MSVC) — shengbinmeng, Mar 23 '13 at 09:18
We can also ask why you need all these different versions. If the program has an acceptable speed using SSE3, why do you need the SSE4 version? Newer CPUs will likely be faster anyway. — Bo Persson, Mar 23 '13 at 12:04
@BoPersson, some functions just can be further speed up by using some new SSE4 instructions. As we are dealing with video encoding/decoding, which can be very time consuming, the SSE4 optimization is meaningful, I think. — shengbinmeng, Mar 23 '13 at 16:58
@BoPersson: Admittedly, SSE4 is kind of a bozo ISA extension for most workloads (though `round[ss/sd/ps/pd]` is occasionally wonderful, and `ptest` and `blendps` definitely have their uses). However, **S**SSE3 (mostly `pshufb` and `pmulhrsw`) and AVX can make an enormous difference if used properly. — Stephen Canon, Mar 24 '13 at 11:56
I just questioned if we need *all* these variants. If SSE4 makes a big difference, who is ever going to use the non-SSE version? — Bo Persson, Mar 24 '13 at 13:54
@BoPersson: There are still many computers without SSE4/SSE3 support, or even without any SSE support. The non-SSE version is for them. — shengbinmeng, Mar 25 '13 at 01:40
@edwin - Yes, but if SSE4 is much faster than SSE3, would not the non-SSE version be terribly slow? Who would then want to use that? — Bo Persson, Mar 25 '13 at 07:34
@BoPersson: Generally, non-SSE version would be 3~4 times slower. Yeah, no one *wants to* use that, but sometimes they *have to* (e.g. with a machine not supporting SSE at all). — shengbinmeng, Mar 25 '13 at 09:11

score 11 · Accepted Answer · answered Mar 23 '13 at 09:16

11

I think that the Mystical's tip is fine, but if you really want to do it in the one file, you can use proper pragmas, for instance:

#pragma GCC target("sse4.1")

GCC 4.4 is needed, AFAIR.

answered Mar 23 '13 at 09:16

konrad.kruczynski

46,413
6
36
47

thank you for this suggestion. I will also try the `#pragma` directive later. – shengbinmeng Mar 23 '13 at 16:53
Can't include smmintrin.h even with #pragma GCC target("sse4") – Trass3r Jul 12 '13 at 11:29

score 2 · Answer 2 · edited May 23 '17 at 11:54

I think you want to build what's called a "CPU dispatcher". I got one working (as far as I know) for GCC but have not got it to work with Visual Studio.
cpu dispatcher for visual studio for AVX and SSE

I would check out Agner Fog's vectorclass and the file dispatch_example.cpp http://www.agner.org/optimize/#vectorclass

g++ -O3 -msse2   -c dispatch_example.cpp -od2.o
g++ -O3 -msse4.1 -c dispatch_example.cpp -od5.o
g++ -O3 -mavx    -c dispatch_example.cpp -od8.o
g++ -O3 -msse2      instrset_detect.cpp d2.o d5.o d8.o

score 0 · Answer 3 · answered Mar 25 '13 at 05:04

Here is an example of compiling a separate object file for each optimization setting: http://notabs.org/lfsr/software/index.htm

But even this method fails when gcc link time optimization (-flto) is used. So how can a single executable be built with full optimization for different processors? The only solution I can find is to use include directives to make the C files behave as a single compilation unit so that -flto is not needed. Here is an example using that method: http://notabs.org/blcutil/index.htm

score 0 · Answer 4 · edited May 23 '17 at 10:31

If you are using GCC 4.9 or above on an i686 or x86_64 machine, then you are supposed to be able to use intrinsics regardless of your -march=XXX and -mXXX options. You could write your do_something() accordingly:

void do_something()
{
    byte temp[18];

    if (HasSSE2())
    {
        const __m128i i = _mm_loadu_si128((const __m128i*)(ptr));
        ...
    }
    else if (HasSSSE3())
    {
        const __m128i MASK = _mm_set_epi8(12,13,14,15, 8,9,10,11, 4,5,6,7, 0,1,2,3);
        _mm_storeu_si128(reinterpret_cast<__m128i*>(temp),
           _mm_shuffle_epi8(_mm_loadu_si128((const __m128i*)(ptr)), MASK));
    }
    else
    {
        // Do the byte swap/endian reversal manually
        ...
    }
}

You have to supply HasSSE2(), HasSSSE3() and friends. Also see Intrinsics for CPUID like informations?.

Also see GCC Issue 57202 - Please make the intrinsics headers like immintrin.h be usable without compiler flags. But I don't believe the feature works. I regularly encounter compile failures because GCC does not make intrinsics available.

What's the proper way to use different versions of SSE intrinsics in GCC?

4 Answers4

Linked