SSE alignment of 3D vector

Question

I wish to ensure SSE is used for arithmetic on my 3D (96 bit) float vectors. However, I have read conflicting views on just what is necessary.

Some articles/posts say I need to use a 4D vector and "ignore" the 4th element, some say I must decorate my class with things like __declspec(align(16)) and override the new operator, and some say the compiler is clever enough to align things for me (I really hope this is true!).

I am using the Eigen library, but find that the "unsupported" AlignedVector3 class isn't fit for purpose (e.g. division by zero errors when doing component-wise division, lpNorm function includes the dummy 4th element).

A lot of the articles I've read are several years old now, so I hold out hope that modern compilers/SSE versions/CPUs can just align the data for me, or work with non-16 byte aligned data. Any up to date knowledge on this will be much appreciated!

CPUs can't just go off and start aligning stuff on their own, they only do what the code tells them to do. Also, if possible, stop this idea and instead use SIMD across the separate coordinates so you don't have to waste the 4th lane (and in general almost everything works out better that way, SIMD vectors are not intended to be used as linalg vectors) — harold, May 22 '16 at 13:59
Thanks for the comment (don't understand why this question was downvoted...). Anyway, I'm not sure what you mean by "use SIMD across the separate coordinates" - do you mean bulk processing multiple 3D vectors (that would be cool as well if its possible)? I've also just discovered the C++11 `alignas(16)` decorator. I added it to wrap my generic-dimensional vector class and it didn't cause a crash - but no proof SIMD is being used, of course. — Dave, May 22 '16 at 19:57
Yes, bulk processing, maybe use 3 pointers (x,y,z) into a block you got with _aligned_malloc. Also, you can load/store unaligned if necessary. This whole declarative alignment deal doesn't work really well throughout C++ stuff, for example if you put that type in a container it will still break unless you use a custom allocator. — harold, May 22 '16 at 20:53

score 2 · Accepted Answer · edited May 23 '17 at 12:31

Actually we use SIMD at work and maybe I can give you my feedback on it. The alignement is something you have to take care of when dealing with SIMD, this is to ensure cache line alignement. However I am not sure if it will still cause a crash if it's not aligned or if the CPU is able to manage anyway (like not aligned scalar types in the old time, it was causing crash, now the CPU handles it but it slows down performances). Maybe you can look here SSE, intrinsics, and alignment It seems to have good answers for the alignement part of the question.

For the fact you are using it as a 3D vector even if it's physically a 4D vector, it's not a really good practice, because you don't profit of the all performance of SIMD instructions. The best way for it to match is to use Structure Of Arrays (SOA).

Note: I am assuming 128 bits SIMD registers mapped to 4 scalar types (int or float)

For example, if you have 4 3D points (or vectors), following your way, you will have 4 4D vectors ignoring the 4th component of each point. In total you end up with 4 * 4 values accessible.

By using SOA, you will have 3 SIMD 128 bits (12 values) registers and you will store your points in the following way. SIMD

r1: x x x x

r2: y y y y

r3: z z z z

This way you fill the entire SIMD registers and thus profit at maximum of SIMD advantages. The other thing is that many of the calculations you will have to make (example add 2 groups of 4 vectors) will only take 3 SIMD instructions. It's a bit tricky to use and understand but when you do, the gain is great.

Of course you won't be able to use it this way in all cases so you will fall back to the original solution of ignoring the last value.

16**bytes**. Unaligned can crash if you use `_mm_load_ps` (even with AVX instead of SSE), but not if you use `_mm_loadu_ps`. — Peter Cordes, May 24 '16 at 00:44
Scalar unaligned has never faulted on x86. You might be remembering working with a different architecture in the past. For me, I remember having my code fault on the Solaris boxes with SPARC CPUs at school. — Peter Cordes, May 24 '16 at 00:45
Yes, even better an AoSoA. See http://compilers.cs.uni-saarland.de/papers/leissa_vecimp_tr.pdf — Z boson, May 26 '16 at 08:16

SSE alignment of 3D vector

1 Answers1