4

I have a shader I need to optimise (with lots of vector operations) and I am experimenting with SSE instructions in order to better understand the problem.

I have some very simple sample code. With the USE_SSE define it uses explicit SSE intrinsics; without it I'm hoping GCC will do the work for me. Auto-vectorisation feels a bit finicky but I'm hoping it will save me some hair.

Compiler and platform is: gcc 4.7.1 (tdm64), target x86_64-w64-mingw32 and Windows 7 on Ivy Bridge.

Here's the test code:

/*
    Include all the SIMD intrinsics.
*/
#ifdef USE_SSE
#include <x86intrin.h>
#endif
#include <cstdio>

#if   defined(__GNUG__) || defined(__clang__) 
    /* GCC & CLANG */

    #define SSVEC_FINLINE __attribute__((always_inline))

#elif defined(_WIN32) && defined(MSC_VER) 
    /* MSVC. */

    #define SSVEC_FINLINE __forceinline

#else
#error Unsupported platform.
#endif


#ifdef USE_SSE

    typedef __m128 vec4f;

    inline void addvec4f(vec4f &a, vec4f const &b)
    {
        a = _mm_add_ps(a, b);
    }

#else

    typedef float vec4f[4];

    inline void addvec4f(vec4f &a, vec4f const &b)
    {
        a[0] = a[0] + b[0];
        a[1] = a[1] + b[1];
        a[2] = a[2] + b[2];
        a[3] = a[3] + b[3];
    }

#endif

int main(int argc, char *argv[])
{
    int const count = 1e7;

    #ifdef USE_SSE
    printf("Using SSE.\n");
    #else
    printf("Not using SSE.\n");
    #endif

    vec4f data = {1.0f, 1.0f, 1.0f, 1.0f};

    for (int i = 0; i < count; ++i)
    {
        vec4f val = {0.1f, 0.1f, 0.1f, 0.1f};
        addvec4f(data, val);
    }

    float result[4] = {0};
    #ifdef USE_SSE
    _mm_store_ps(result, data);
    #else
    result[0] = data[0];
    result[1] = data[1];
    result[2] = data[2];
    result[3] = data[3];
    #endif

    printf("Result: %f %f %f %f\n", result[0], result[1], result[2], result[3]);

    return 0;
}

This is compiled with:

g++ -O3 ssetest.cpp -o nossetest.exe
g++ -O3 -DUSE_SSE ssetest.cpp -o ssetest.exe

Apart from the explicit SSE-version being a bit quicker there is no difference in output.

Here's the assembly for the loop, first explicit SSE:

.L3:
subl    $1, %eax
addps   %xmm1, %xmm0
jne .L3

It inlined the call. Nice, more or less just a straight up _mm_add_ps.

Array version:

.L3:
subl    $1, %eax
addss   %xmm0, %xmm1
addss   %xmm0, %xmm2
addss   %xmm0, %xmm3
addss   %xmm0, %xmm4
jne .L3

It is using SSE math alright, but on each array member. Not really desirable.

My question is, how can I help GCC so that it can better optimise the array version of vec4f?

Any Linux specific tips is helpful too, that's where the real code will run.

Skurmedel
  • 21,515
  • 5
  • 53
  • 66
  • Be aware that `float result[4]` may not be 16-byte aligned on the stack - it happens to work in this instance, or `_mm_store_ps` would fault. – Brett Hale Mar 18 '13 at 00:44

2 Answers2

7

This LockLess article on Auto-vectorization with gcc 4.7 is hands down the best article I have ever seen and I have spent a while looking for good articles on similar topics. They also have a lot of other articles that you may find very useful on similar subjects dealing all manners of low level software development.

Shafik Yaghmour
  • 154,301
  • 39
  • 440
  • 740
  • Thanks for the link, trying the tips in that article but it seems really hard to get it auto-vectorised properly. Trying a struct right now, forced to be 16-byte aligned but the best I can do is 4 `addss` instructions. – Skurmedel Mar 18 '13 at 01:18
  • 1
    @Skurmedel Agreed, I found this and other articles doing research when I was more actively expanding the SSE section of X86 wikibook. Have you look at intrinsics? I have been looking for some of the intrinsics material I had but I can not seem to dig it up anymore. – Shafik Yaghmour Mar 19 '13 at 02:17
5

Here is some tips based on your code to make gcc auto-vectorization works:

  • make the loop-upbound a const. To vectorize, GCC need to split the loop by 4-iterations to fit in the SSE XMM register, which is 128-bit length. a const loop upper bound will help GCC make sure that the loop have plenty of iterations, and the vectorization is profitable.
  • remove the inline keyword. if the code is marked as inline, GCC can not know whether the start point of the array is aligned without inter-procedure analysis which will not turned on by -O3.

    so, to make your code vectorized, your addvec4f function should be modified as the following:

    void addvec4f(vec4f &a, vec4f const &b)
    {
        int i = 0;
        for(;i < 4; i++)
          a[i] = a[i]+b[i];
    }
    

BTW:

  • GCC also have flags to help you find out whether a loop have been vectorized. -ftree-vectorizer-verbose=2, higher number will have more output information, currently the value can be 0,1,2.Here is the documentation of this flag, and some other related flag.
  • Be careful of the alignment. The address of the array should be aligned, and the compiler can not know whether the address is aligned without running it.Usually, there will be a bus error if the data is not aligned. Here is the reason.
Community
  • 1
  • 1
Kun Ling
  • 2,211
  • 14
  • 22