I've been encountering a very subtle issue on SSE. Here is the case, I want to optimise my ray tracer with SSE so that I can get a basic feeling how to improve the performance with SSE.
I'd like to start with this very function.
Vector3f Add( const Vector3f& v0 , Vector3f& v1 );
(Actually I tried to optimise CrossProduct first, adding is shown here for simplicity and I knew it is not the bottleneck of my ray tracer.)
Here is a part of the definition of the struct:
struct Vector3f
{ union { struct{ float x ; float y ; float z; float reserved; }; __m128 data; };
The issue is there will be SSE register flush with this very declaration, the compiler is not smart enough to hold those sse register for further uses. And with the following declaration, it avoids the flushing.
__m128 Add( __m128 v0_data, __m128 v1_data );
I can go with this way on this case, however it would be ugly design for Matrix which holds four __m128 data. And you can't have operator works on the Vector3f itself but on its data, :(.
The most disturbing thing is that you will have to change your higher level code everywhere to adapt the change. And this way of optimisation through SSE is definitely no option for something large like a huge game engine, you'll change huge amount of code before it works.
Without avoiding the SSE register flushing, its power will be drained out by those useless flushing command which renders SSE useless, I guess.