1

at the moment i am accessing my float values via a Union

typedef union
{
  float v[4];
  _mm128 m;
}SSEFloat;

but in this link i heared that the performane is loss. Is there a performane lose with the GCC 4 ? Does the float need to be aligned ? In the Union too ? Or is it correct to set the values like this

SSEFloat a;
float tmp = 10.0;
a.m = _mm_load1_ps( &tmp );

At the moment i couldnt find the Intel SSE Intrinsic Documentation too :( Is there a "small" list of - what to know for speed optimization ?

Community
  • 1
  • 1
Roby
  • 2,011
  • 4
  • 28
  • 55
  • hmm.. now found it http://software.intel.com/sites/default/files/m/9/4/c/8/e/18072-347603.pdf :P – Roby Apr 15 '13 at 19:12

2 Answers2

2

The compiler will guarantee that the code will execute correctly, but it may sacrifice performance for correctness. Since the union is really only adding syntactic convenience for accessing the individual elements of a 4-item float vector, and the _mm128 object is (conceptually, if not actually) sitting in a register, I recommend you just use the _mm128 object directly and use the _mm_store_ps and _mm_load_ps family of APIs to move data in and out of the object.

Comments in the link you supplied suggest that the compiler can do poor optimization around the union, especially with _mm128s. If you want to be sure of this, you should do experiments both with and without the union. For high-resolution time measurement in Linux I recommend the pthread_getcpuclockid and clock_gettime APIs. Post your results if you can!

In general, for best performance, make things as easy and simple for the compiler as possible. This means keeping high-performance things like _mm128 out of complex structures like unions and instead just declare them on the stack or in memory allocated expressly for them.

Randall Cook
  • 6,728
  • 6
  • 33
  • 68
  • `&v` will also kill performance, since it prevents the compiler from enregistering the value. Instead you should use `_mm_store_ps` and `_mm_load_ps` when treating the SSE tuple as an array. – Ben Voigt Apr 15 '13 at 22:31
  • @Ben Voigt. You should rather use _mm_set1_ps over _mm_load_ps, as _mm_set1_ps doesn't map one-to-one to assembly instructions and is optimized by compiler. On the other hand _mm_load_ps is one-to-one mapped to MOVAPS instruction. – ZarakiKenpachi Apr 15 '13 at 22:51
  • I don't think we know enough about the OP's application to know which API to use. I revised my answer to point to the family of load and store APIs, the most appropriate of which should be selected. – Randall Cook Apr 15 '13 at 23:01
  • @Ben Voigt, do i need to align the v ? – Roby Apr 16 '13 at 05:19
  • @Roby: Depends on whether you use `_mm_load_ps` or `_mm_loadu_ps`. – Ben Voigt Apr 16 '13 at 06:03
  • @Ben Voigt, atm im useing mm_load_ps but without alignement but it works fine.. hadnt any problems with this :-/ – Roby Apr 16 '13 at 07:12
  • @Roby: Even when you don't request alignment, you may get good alignment by accident. But don't rely on it. – Ben Voigt Apr 16 '13 at 12:46
0

If you use the floats in the union the compiler will probably output non-sse code for accessing them which will be a performance hit. It really depends on your object usage. You can add _MM_ALIGN16 (__declspec(align(16)) in front of a wrapper struct and override new and delete operators (if your are coding C++). Check this question: SSE, intrinsics, and alignment

Community
  • 1
  • 1
Trax
  • 1,890
  • 12
  • 15
  • Hey , so for Speed its better to use Not a Union? If i allocate a Array of mm128 do i really Need to align them ? I thought These Elements are aligned? – Roby May 26 '13 at 18:58
  • Union is fine, but you will need to take a look at the generated code since the compiler may not be doing what you are thinking. If you use an aligned member inside an struct, the struct inherits the alignment (check this: http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Type-Attributes.html) – Trax May 27 '13 at 06:41