Is this some kind of alignment problem?
Almost certainly.
C compilers assume that a __m128
object has 16-byte alignment, and use movaps
to load/store it, or use it as a memory operand to other SSE instructions (like addps xmm0, [mem]
). These uses will fault at runtime if the pointer doesn't have 16-byte alignment.
But you haven't told Python to allocate float Elements[4][4]
with any kind of alignment guarantee, so passing pointers to C will give you invalid union
objects that violate the requirement that the union is aligned enough for its most-aligned member.
If you can't get Python to guarantee 16-byte alignment of your objects, then you will have to change your C to still work (slightly less efficiently). Compiling with AVX enabled (gcc -O3 -march=native
on AVX CPUs) will allow the compiler to use unaligned 16-byte vectors as memory operands. But it still won't make a misaligned __m128
safe, because it will still store with vmovaps
not vmovups
.
Modern hardware has efficient unaligned load support, but cache-line splits are still not ideal. Instruction-count is also worse because with AVX, the compiler will have to use separate movups
loads instead of addps xmm0, [mem]
for data that only needs to be loaded once.
In C, remove the __m128
member, and use _mm_loadu_ps()
to do unaligned loads.
typedef struct my_mat4 { float Elements[4][4]; } my_mat4;
static inline
__m128 load_vec(const struct my_mat4 *m4, size_t idx) {
_mm_loadu_ps(&m4->Elements[idx][0]);
}
With GNU C: redefine your union with an unaligned version of __m128
It would be most efficient to get Python to align your objects, but if not, this will let you compile your existing code with only one change to your object:
__m128
is defined in terms of GNU C native vectors in xmmintrin.h#69. (Other compilers that support GNU extensions are compatible, at least clang is.)
typedef float __m128 attribute ((vector_size (16), may_alias));
The header already defines an unaligned __m128_u
that also uses aligned(1)
. We can use aligned(4)
to guarantee that it's at least aligned on a float
boundary, in case that helps.
This Just Works because different alignment versions of the same vector type are freely convertible, so code passing it to intrinsics compiles without warnings (even at -Wall
).
typedef float __attribute((vector_size(16), aligned(4))) unaligned__m128;
// I left out may_alias, only matters if you're using unaligned__m128* to load from non-float data.
// Probably doesn't hurt code-gen if you aren't using unaligned__m128* at all, just objects
//#define __m128 unaligned__m128 // not needed
typedef union my_mat4 {
float Elements[4][4];
unaligned__m128 Rows[4];
} my_mat4;
Functions using this type compile just fine (gcc8.1 on the Godbolt compiler explorer). (You could also have written m4->Rows[1] + m4->Rows[2]
, even in C not C++, because GNU C native vectors map C operators to per-element operations.
__m128 use_row(union my_mat4 *m4) {
__m128 tmp = _mm_add_ps(m4->Rows[1], m4->Rows[2]);
m4->Rows[3] = tmp;
return tmp;
}
With just -O3
(no -march), we get
movups xmm0, XMMWORD PTR [rdi+32] # unaligned loads
movups xmm1, XMMWORD PTR [rdi+16]
addps xmm0, xmm1
movups XMMWORD PTR [rdi+48], xmm0 # unaligned store
ret
But with -mavx
(enabled by -march=haswell
, for example), we get
use_row(my_mat4*):
vmovups xmm1, XMMWORD PTR [rdi+32]
vaddps xmm0, xmm1, XMMWORD PTR [rdi+16] # unaligned memory source is ok for AVX
vmovups XMMWORD PTR [rdi+48], xmm0
ret
Of course you'd want these functions to inline, I only made them non-inline so I could look at how they compiled. (How to remove "noise" from GCC/clang assembly output?).
BTW, defining MAT4_MATH__USE_SSE
can change the ABI if you ever use this union as a member of a wider struct. struct {int foo; my_mat4 m4; };
needs 12 bytes of padding if my_mat4
is aligned, or no padding otherwise.
If you compile some C with and some C without the macro def, you could do something like this (if you've solved the problem of getting Python to align objects):
#include <stdalign.h>
// give the same alignment regardless of whether the macro is defined.
typedef union my_mat4
{
alignas(16) float Elements[4][4];
#ifdef MAT4_MATH__USE_SSE
__m128 Rows[4];
#endif
} my_mat4;
Or not if you don't want to guarantee alignment when the macro is undefined.