Helping GCC with auto-vectorisation

Question

I have a shader I need to optimise (with lots of vector operations) and I am experimenting with SSE instructions in order to better understand the problem.

I have some very simple sample code. With the USE_SSE define it uses explicit SSE intrinsics; without it I'm hoping GCC will do the work for me. Auto-vectorisation feels a bit finicky but I'm hoping it will save me some hair.

Compiler and platform is: gcc 4.7.1 (tdm64), target x86_64-w64-mingw32 and Windows 7 on Ivy Bridge.

Here's the test code:

/*
    Include all the SIMD intrinsics.
*/
#ifdef USE_SSE
#include <x86intrin.h>
#endif
#include <cstdio>

#if   defined(__GNUG__) || defined(__clang__) 
    /* GCC & CLANG */

    #define SSVEC_FINLINE __attribute__((always_inline))

#elif defined(_WIN32) && defined(MSC_VER) 
    /* MSVC. */

    #define SSVEC_FINLINE __forceinline

#else
#error Unsupported platform.
#endif


#ifdef USE_SSE

    typedef __m128 vec4f;

    inline void addvec4f(vec4f &a, vec4f const &b)
    {
        a = _mm_add_ps(a, b);
    }

#else

    typedef float vec4f[4];

    inline void addvec4f(vec4f &a, vec4f const &b)
    {
        a[0] = a[0] + b[0];
        a[1] = a[1] + b[1];
        a[2] = a[2] + b[2];
        a[3] = a[3] + b[3];
    }

#endif

int main(int argc, char *argv[])
{
    int const count = 1e7;

    #ifdef USE_SSE
    printf("Using SSE.\n");
    #else
    printf("Not using SSE.\n");
    #endif

    vec4f data = {1.0f, 1.0f, 1.0f, 1.0f};

    for (int i = 0; i < count; ++i)
    {
        vec4f val = {0.1f, 0.1f, 0.1f, 0.1f};
        addvec4f(data, val);
    }

    float result[4] = {0};
    #ifdef USE_SSE
    _mm_store_ps(result, data);
    #else
    result[0] = data[0];
    result[1] = data[1];
    result[2] = data[2];
    result[3] = data[3];
    #endif

    printf("Result: %f %f %f %f\n", result[0], result[1], result[2], result[3]);

    return 0;
}

This is compiled with:

g++ -O3 ssetest.cpp -o nossetest.exe
g++ -O3 -DUSE_SSE ssetest.cpp -o ssetest.exe

Apart from the explicit SSE-version being a bit quicker there is no difference in output.

Here's the assembly for the loop, first explicit SSE:

.L3:
subl    $1, %eax
addps   %xmm1, %xmm0
jne .L3

It inlined the call. Nice, more or less just a straight up _mm_add_ps.

Array version:

.L3:
subl    $1, %eax
addss   %xmm0, %xmm1
addss   %xmm0, %xmm2
addss   %xmm0, %xmm3
addss   %xmm0, %xmm4
jne .L3

It is using SSE math alright, but on each array member. Not really desirable.

My question is, how can I help GCC so that it can better optimise the array version of vec4f?

Any Linux specific tips is helpful too, that's where the real code will run.

Be aware that `float result[4]` may not be 16-byte aligned on the stack - it happens to work in this instance, or `_mm_store_ps` would fault. — Brett Hale, Mar 18 '13 at 00:44

score 7 · Answer 1 · answered Mar 17 '13 at 23:06

7

This LockLess article on Auto-vectorization with gcc 4.7 is hands down the best article I have ever seen and I have spent a while looking for good articles on similar topics. They also have a lot of other articles that you may find very useful on similar subjects dealing all manners of low level software development.

answered Mar 17 '13 at 23:06

Shafik Yaghmour

154,301
39
440
740

Thanks for the link, trying the tips in that article but it seems really hard to get it auto-vectorised properly. Trying a struct right now, forced to be 16-byte aligned but the best I can do is 4 `addss` instructions. – Skurmedel Mar 18 '13 at 01:18
1

@Skurmedel Agreed, I found this and other articles doing research when I was more actively expanding the SSE section of X86 wikibook. Have you look at intrinsics? I have been looking for some of the intrinsics material I had but I can not seem to dig it up anymore. – Shafik Yaghmour Mar 19 '13 at 02:17

score 5 · Answer 2 · edited May 23 '17 at 12:32

Here is some tips based on your code to make gcc auto-vectorization works:

make the loop-upbound a const. To vectorize, GCC need to split the loop by 4-iterations to fit in the SSE XMM register, which is 128-bit length. a const loop upper bound will help GCC make sure that the loop have plenty of iterations, and the vectorization is profitable.
remove the inline keyword. if the code is marked as inline, GCC can not know whether the start point of the array is aligned without inter-procedure analysis which will not turned on by -O3.

so, to make your code vectorized, your addvec4f function should be modified as the following:
```
void addvec4f(vec4f &a, vec4f const &b)
{
    int i = 0;
    for(;i < 4; i++)
      a[i] = a[i]+b[i];
}
```

BTW:

GCC also have flags to help you find out whether a loop have been vectorized. -ftree-vectorizer-verbose=2, higher number will have more output information, currently the value can be 0,1,2.Here is the documentation of this flag, and some other related flag.
Be careful of the alignment. The address of the array should be aligned, and the compiler can not know whether the address is aligned without running it.Usually, there will be a bus error if the data is not aligned. Here is the reason.

Thanks, gonna try your suggestions. – Skurmedel Mar 19 '13 at 18:51 — Skurmedel, Mar 19 '13 at 18:51

Helping GCC with auto-vectorisation

2 Answers2