Is it safe/possible/advisable to cast floats directly to __m128
if they are 16 byte aligned?
I noticed using _mm_load_ps
and _mm_store_ps
to "wrap" a raw array adds a significant overhead.
What are potential pitfalls I should be aware of?
EDIT :
There is actually no overhead in using the load and store instructions, I got some numbers mixed and that is why I got better performance. Even thou I was able to do some HORRENDOUS mangling with raw memory addresses in a __m128
instance, when I ran the test it took TWICE AS LONG to complete without the _mm_load_ps
instruction, probably falling back to some fail safe code path.