17

Is it safe/possible/advisable to cast floats directly to __m128 if they are 16 byte aligned?

I noticed using _mm_load_ps and _mm_store_ps to "wrap" a raw array adds a significant overhead.

What are potential pitfalls I should be aware of?

EDIT :

There is actually no overhead in using the load and store instructions, I got some numbers mixed and that is why I got better performance. Even thou I was able to do some HORRENDOUS mangling with raw memory addresses in a __m128 instance, when I ran the test it took TWICE AS LONG to complete without the _mm_load_ps instruction, probably falling back to some fail safe code path.

Cœur
  • 37,241
  • 25
  • 195
  • 267
dtech
  • 47,916
  • 17
  • 112
  • 190

5 Answers5

11

What makes you think that _mm_load_ps and _mm_store_ps "add a significant overhead" ? This is the normal way to load/store float data to/from SSE registers assuming source/destination is memory (and any other method eventually boils down to this anyway).

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    Because I actually profiled it. Adding the same length arrays in scalar takes 0.337 seconds, in SSE with load and store functions takes 0.244 seconds and without any coversion (using an array of __m128's) the same operation takes 0.127 seconds - almost twice as fast! – dtech Aug 01 '12 at 13:14
  • Actually the numbers vary, but an array of __m128's is always significantly faster than using the load and store functions and a raw array of floats. 50% of the times it is over twice as fast, sometimes not that much. – dtech Aug 01 '12 at 13:23
  • 3
    I think you're probably misinterpreting the results of your profiling. It sounds like you're comparing explicit loads/stores against compiler-generated loads/stores, but the same instructions are most likely being used "under the hood" - you're just seeing the effects of different instruction scheduling/loop unrolling/etc. It would be useful to see your code though to see what it is exactly that you're measuring. – Paul R Aug 01 '12 at 13:37
  • 7
    Paul - you seem to be right, the lower time was actually due to some number mismatch I did due to negligence. Without the load and store functions the operation actually takes longer, but is still completed accurately, probably falling back to some fail safe. – dtech Aug 01 '12 at 13:51
9

There are several ways to put float values into SSE registers; the following intrinsics can be used:

__m128 sseval;
float a, b, c, d;

sseval = _mm_set_ps(a, b, c, d);  // make vector from [ a, b, c, d ]
sseval = _mm_setr_ps(a, b, c, d); // make vector from [ d, c, b, a ]
sseval = _mm_load_ps(&a);         // ill-specified here - "a" not float[] ...
                                  // same as _mm_set_ps(a[0], a[1], a[2], a[3])
                                  // if you have an actual array

sseval = _mm_set1_ps(a);          // make vector from [ a, a, a, a ]
sseval = _mm_load1_ps(&a);        // load from &a, replicate - same as previous

sseval = _mm_set_ss(a);           // make vector from [ a, 0, 0, 0 ]
sseval = _mm_load_ss(&a);         // load from &a, zero others - same as prev

The compiler will often create the same instructions no matter whether you state _mm_set_ss(val) or _mm_load_ss(&val) - try it and disassemble your code.

It can, in some cases, be advantageous to write _mm_set_ss(*valptr) instead of _mm_load_ss(valptr) ... depends on (the structure of) your code.

FrankH.
  • 17,675
  • 3
  • 44
  • 63
  • 10x, i may go for a similar implementation – dtech Aug 01 '12 at 19:42
  • 3
    I believe the biggest reason for the large variety in intrinsics is so that a) the programmer can choose to directly use constants instead of vars (like, `__m128 s = _mm_set1_ps(M_PI);` instead of `float pi[4] = { M_PI, M_PI, M_PI, M_PI }; __m128 s = _mm_load_ps(pi);`), and b) to allow the compiler optimizing certain cases where data already available / previously loaded can be re-used instead of issuing another memory access. I tend to write the code "compact" and disassemble the result, to get an idea if it went right ... – FrankH. Aug 01 '12 at 20:15
7

Going by http://msdn.microsoft.com/en-us/library/ayeb3ayc.aspx, it's possible but not safe or recommended.

You should not access the __m128 fields directly.


And here's the reason why:

http://social.msdn.microsoft.com/Forums/en-US/vclanguage/thread/766c8ddc-2e83-46f0-b5a1-31acbb6ac2c5/

  1. Casting float* to __m128 will not work. C++ compiler converts assignment to __m128 type to SSE instruction loading 4 float numbers to SSE register. Assuming that this casting is compiled, it doesn't create working code, because SEE loading instruction is not generated.

__m128 variable is not actually variable or array. This is placeholder for SSE register, replaced by C++ compiler to SSE Assembly instruction. To understand this better, read Intel Assembly Programming Reference.

JAB
  • 20,783
  • 6
  • 71
  • 80
  • 1
    yeah, I kind of saw this, but without an explanation WHY I somehow feel there is little value. It is more like I want to know for the pitfalls of doing so, because I plan to :) – dtech Aug 01 '12 at 13:02
  • Hm, well, looking through, it seems `__m128` is defined with `__attribute__ ((vector_size (16)))` (see http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html). I suppose a direct cast to `__m128` may not actually utilize the designated registers for such operations properly? – JAB Aug 01 '12 at 13:08
  • Sorry to bump - things seem to have changed: __m128 now actually is declared as a union with respective member-arrays. And casting a `float*` to a `__m128*` also is ok, as long as alignment requirements are met on the `float*`. (Edit: I'm on Windows, using VS2012) – St0fF Jul 13 '16 at 21:22
  • @St0fF Interesting. Perhaps you should turn that into an answer? – JAB Jul 14 '16 at 20:25
  • 1
    The second part of this answer is bogus, unless MSVC is totally weird. Dereferencing a `__m128 *` is fine, and generates an aligned load/store. If that's not what you want, then don't do it. – Peter Cordes Aug 25 '16 at 09:32
  • @PeterCordes It wouldn't be the first time MSVC did something weird like that, though the answer seems to be outdated at this point anyway. – JAB Aug 25 '16 at 13:32
6

A few years have passed since the question was asked. To answer the question my experience shows:

YES

reinterpret_cast-casting a float* into a __m128* and vice versa is good as long as that float* is 16-byte-aligned - example (in MSVC 2012):

__declspec( align( 16 ) ) float f[4];
return _mm_mul_ps( _mm_set_ps1( 1.f ), *reinterpret_cast<__m128*>( f ) );
St0fF
  • 1,553
  • 12
  • 22
  • Was actually looking at SIMD code of glm math library where reinterpret_cast is used,and wondered how valid such a technique could possibly be. – Michael IV Jul 17 '19 at 07:25
1

The obvious issue I can see is that you're than aliasing (referring to a memory location by more than one pointer type), which can confuse the optimiser. Typical issues with aliasing is that since the optimiser doesn't observe that you're modifying a memory location through the original pointer, it considers it to be unchanged.

Since you're obviously not using the optimiser to its full extent (or you'd be willing to rely on it to emit the correct SSE instructions) you'll probably be OK.

The problem with using the intrinsics yourself is that they're designed to operate on SSE registers, and can't use the instruction variants that load from a memory location and process it in a single instruction.

ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • `__m128` is allowed to alias other types, including `float` or `__m128d`. (This is [why gcc defines `__m128` as `may_alias`](http://stackoverflow.com/questions/39114159/how-do-you-load-store-from-to-an-array-of-doubles-with-gnu-c-vector-extensions?noredirect=1#comment65607228_39114159), so it compiles as expected even with the default strict-aliasing.) Most of the time compilers will fold load intrinsics into memory operands for ALU instructions, so your last paragraph doesn't really apply either (at least with modern optimizing compilers). – Peter Cordes Aug 25 '16 at 09:29