How to avoid SSE pipeline flush?

Question

I've been encountering a very subtle issue on SSE. Here is the case, I want to optimise my ray tracer with SSE so that I can get a basic feeling how to improve the performance with SSE.

I'd like to start with this very function.

Vector3f Add( const Vector3f& v0 , Vector3f& v1 );

(Actually I tried to optimise CrossProduct first, adding is shown here for simplicity and I knew it is not the bottleneck of my ray tracer.)

Here is a part of the definition of the struct:

struct Vector3f
{ union { struct{ float x ; float y ; float z; float reserved; }; __m128 data; };

The issue is there will be SSE register flush with this very declaration, the compiler is not smart enough to hold those sse register for further uses. And with the following declaration, it avoids the flushing.

__m128 Add( __m128 v0_data, __m128 v1_data );

I can go with this way on this case, however it would be ugly design for Matrix which holds four __m128 data. And you can't have operator works on the Vector3f itself but on its data, :(.

The most disturbing thing is that you will have to change your higher level code everywhere to adapt the change. And this way of optimisation through SSE is definitely no option for something large like a huge game engine, you'll change huge amount of code before it works.

Without avoiding the SSE register flushing, its power will be drained out by those useless flushing command which renders SSE useless, I guess.

@JerryCao1985 You can mention this directly in your question by editing it, instead of adding it as a comment. — Borgleader, Jul 09 '15 at 15:47
If you're serious about SIMD optimisation then you need to be prepared to completely re-factor your code so that you can process large chunks of data homogeneously entirely in SIMD. Trying to apply SIMD in an ad hoc fashion to an existing code base will usually be sub-optimal, as you are already seeing with the above example. — Paul R, Jul 09 '15 at 16:23
It definitely *is* an option for huge game engines, and yes the SIMD invades and tries to "taint" all your code. So be it. Just write your code with that in mind from the beginning and you'll be fine. — harold, Jul 09 '15 at 18:51
[fast-dot-product-using-sse-avx-intrinsics](https://stackoverflow.com/questions/30590487/fast-dot-product-using-sse-avx-intrinsics/30596772#30596772). — Z boson, Jul 10 '15 at 07:00
Are you saying that just passing around your data in a union type is preventing the compiler from keeping it in xmm registers? Or is it only storing to memory when you mix it with code that accesses the components one at a time? (And BTW, why the `struct` around the `union`? Why not just typedef your `Vector3f` as a union containing a struct and a `__m128`?) — Peter Cordes, Jul 11 '15 at 04:55
Passing the structure will ruin the optimization in Visual studio 2012. I also have some member functions for this struct not shown here, that's why I wrap it with a structure. It is also mentioned here: http://www.gamedev.net/page/resources/_/technical/game-programming/practical-cross-platform-simd-math-part-2-r3101 — JerryCao1985, Jul 12 '15 at 05:54

score 1 · Answer 1 · answered Sep 12 '15 at 11:53

It seems that union is a bad thing to use here. As long as a compiler sees __m128 unified with something, it has problems with understanding when to update values, leading to excessive memory operations.

MSVC is not the worst performing compiler in this situation. Just check the code generated by GCC 5.1.0, it works 12 times slower than the code generated by MSVC2013 (which is with registers spilling) on my machine, and 20+ times slower than the optimal code.

It is interesting that most compilers start doing silly things only when you really use x, y, z members to access your data. For instance, MSVC2013 spills registers only when you read them via scalar members after computation (I guess to make sure these members are actual). The terrible behavior of GCC seen above disappears if you set initial values with _mm_setr_ps instead of writing them to directly into members.

It is better to avoid unions in this case. It seems that OP has come to the same decision (see current Vector3fv code). Making it harder to access a single coordinate has a good "psychological" performance effect: a person would think twice before writing scalar code. You can easily write setters/getters either with extract/insert intrinsics (which makes compiler generate these instructions), or with simple pointer arithmetic (which makes compiler choose some way):

float getX() const { return ((float*)&data)[0]; }

When I remove union and simply use __m128, the generated code becomes better on all compilers. However, MSVC2013 still has unnecessary moves: one useless register move per each arithmetic operation. I suppose this is an inefficiency in the compiler's inlining algorithm. You can remove these moves in MSVC2013 by declaring all your functions as __vectorcall. Note that using this new calling convention also allows you to avoid register spilling in case your simd functions have not been inlined at all.

How to avoid SSE pipeline flush?

1 Answers1