3

I have following snippet of C code that I want wrap with Cython and use in my Python program:

typedef union my_mat4
{
    float Elements[4][4];
#ifdef MAT4_MATH__USE_SSE
    __m128 Rows[4];
#endif
} my_mat4;

static inline my_mat4 init_my_mat4(void)
{
    my_mat4 Result = {0};
    return (Result);
}

I wrap this code in mat4.pxd following way:

cdef extern from "mat4.h":
    ctypedef union my_mat4:
        float Elements[4][4]

    my_mat4 init_my_mat4()

cdef class MyClass:
    cdef my_mat4 m
    cdef object o

and the mat4.pyx code:

cdef class MyClass:
    def __cinit__(self):
        self.m = init_my_mat4()

When I use MyClass in Python

my_class = MyClass()

Python exits with message Process finished with exit code 139 (interrupted by signal 11: SIGSEGV).

However, when I disable MAT4_MATH__USE_SSE from the C union, program runs normally. Even when I remove cdef object o from MyClass, everything runs ok.

I searched extensively for a solution but couldn't find anything. Is this some kind of alignment problem?

Thanks, A.

EDIT 1:

The Python program crashes only when module decimal is imported and __m128 is compiled in Cython extension module, so:

import decimal
my_class = MyClass()

crashes the interpreter (I get splash icon on my Ubuntu program launcher)

EDIT 2:

As @DavidW in comments said, the import decimal is probably not significant in this case - for him the program segfaults without the import, runs with import.

EDIT 3:

As @PeterCordes wrote, almost certainly it's misalignment problem. The solution is to remove __m128 member from union and make proposed load_vec() or similar macro with _mm_loadu_ps.

EDIT 4:

Another solution is using unaligned__m128 defined as typedef float __attribute((vector_size(16), aligned(4))) unaligned__m128; instead of __m128 in union. Thanks @PeterCordes.

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    And how do you build your extension? Do you pass the define MAT4... to all compilation units? – ead Jul 05 '18 at 09:47
  • Yes, this is my setup.py script: import os from distutils.core import setup from Cython.Build import cythonize os.environ['CFLAGS'] = '-std=c11 -DMAT4_MATH__USE_SSE=1' setup( name = 'mat4math', ext_modules = cythonize("game/math/*.pyx"), ) – Andrej Kesely Jul 05 '18 at 09:52
  • I tried also hardcode __m128 Rows[4] without #ifdef, the same result, program crashed. However, when only float Elements[4][4]; are present in the union (I removed the __m128 part), everything runs normally. – Andrej Kesely Jul 05 '18 at 09:56
  • Does the issue also happen if you make a simple equivalent C program? (i.e. is the issue with Cython/Python or the C union?) – DavidW Jul 05 '18 at 11:06
  • as @DavidW mentioned I suggest you to try with simple C program before integrating with python. If you face same issue then it might be related to memory alignment. – yadhu Jul 05 '18 at 13:18
  • @DavidW Yes, I've tried it barebone, here is the source code: https://pastebin.com/VwK4Gk2y. The problem seems **import decimal** in Python code. When I import decimal and the Cython extension module is compiled with __m128 intrinsics, Python crashes. I'm wondering, is it bug in Cython/Python or my side (most probably). What is import decimal doing in this case? – Andrej Kesely Jul 05 '18 at 13:36
  • I get a segmentation fault without `import decimal` but not with. I suspect something is always wrong, but whether you trigger it is luck depending on what is in memory. I don't think decimal is significant – DavidW Jul 05 '18 at 17:42
  • @DavidW Thanks for looking at it. Yes, could be that something is always wrong and in my case import decimal just triggers the segfault. I think this problem is above my knowledge now, I just removed __m128 from the union. But I don't know where to look for additional help. – Andrej Kesely Jul 05 '18 at 18:09

1 Answers1

2

Is this some kind of alignment problem?

Almost certainly.

C compilers assume that a __m128 object has 16-byte alignment, and use movaps to load/store it, or use it as a memory operand to other SSE instructions (like addps xmm0, [mem]). These uses will fault at runtime if the pointer doesn't have 16-byte alignment.

But you haven't told Python to allocate float Elements[4][4] with any kind of alignment guarantee, so passing pointers to C will give you invalid union objects that violate the requirement that the union is aligned enough for its most-aligned member.


If you can't get Python to guarantee 16-byte alignment of your objects, then you will have to change your C to still work (slightly less efficiently). Compiling with AVX enabled (gcc -O3 -march=native on AVX CPUs) will allow the compiler to use unaligned 16-byte vectors as memory operands. But it still won't make a misaligned __m128 safe, because it will still store with vmovaps not vmovups.

Modern hardware has efficient unaligned load support, but cache-line splits are still not ideal. Instruction-count is also worse because with AVX, the compiler will have to use separate movups loads instead of addps xmm0, [mem] for data that only needs to be loaded once.

In C, remove the __m128 member, and use _mm_loadu_ps() to do unaligned loads.

typedef struct my_mat4 { float Elements[4][4]; } my_mat4;

static inline
__m128 load_vec(const struct my_mat4 *m4, size_t idx) {
    _mm_loadu_ps(&m4->Elements[idx][0]);
}

With GNU C: redefine your union with an unaligned version of __m128

It would be most efficient to get Python to align your objects, but if not, this will let you compile your existing code with only one change to your object:

__m128 is defined in terms of GNU C native vectors in xmmintrin.h#69. (Other compilers that support GNU extensions are compatible, at least clang is.)

typedef float __m128 attribute ((vector_size (16), may_alias));

The header already defines an unaligned __m128_u that also uses aligned(1). We can use aligned(4) to guarantee that it's at least aligned on a float boundary, in case that helps.

This Just Works because different alignment versions of the same vector type are freely convertible, so code passing it to intrinsics compiles without warnings (even at -Wall).

typedef float __attribute((vector_size(16), aligned(4))) unaligned__m128;
// I left out may_alias, only matters if you're using unaligned__m128* to load from non-float data.
// Probably doesn't hurt code-gen if you aren't using unaligned__m128* at all, just objects

   //#define __m128 unaligned__m128   // not needed

typedef union my_mat4 {
     float Elements[4][4];
     unaligned__m128 Rows[4];
} my_mat4;

Functions using this type compile just fine (gcc8.1 on the Godbolt compiler explorer). (You could also have written m4->Rows[1] + m4->Rows[2], even in C not C++, because GNU C native vectors map C operators to per-element operations.

__m128 use_row(union my_mat4 *m4) {
    __m128 tmp = _mm_add_ps(m4->Rows[1], m4->Rows[2]);
    m4->Rows[3] = tmp;
    return tmp;
}

With just -O3 (no -march), we get

    movups  xmm0, XMMWORD PTR [rdi+32]    # unaligned loads
    movups  xmm1, XMMWORD PTR [rdi+16]
    addps   xmm0, xmm1
    movups  XMMWORD PTR [rdi+48], xmm0    # unaligned store
    ret

But with -mavx (enabled by -march=haswell, for example), we get

use_row(my_mat4*):
    vmovups xmm1, XMMWORD PTR [rdi+32]
    vaddps  xmm0, xmm1, XMMWORD PTR [rdi+16]    # unaligned memory source is ok for AVX
    vmovups XMMWORD PTR [rdi+48], xmm0
    ret

Of course you'd want these functions to inline, I only made them non-inline so I could look at how they compiled. (How to remove "noise" from GCC/clang assembly output?).


BTW, defining MAT4_MATH__USE_SSE can change the ABI if you ever use this union as a member of a wider struct. struct {int foo; my_mat4 m4; }; needs 12 bytes of padding if my_mat4 is aligned, or no padding otherwise.

If you compile some C with and some C without the macro def, you could do something like this (if you've solved the problem of getting Python to align objects):

#include <stdalign.h>

// give the same alignment regardless of whether the macro is defined.
typedef union my_mat4
{
    alignas(16) float Elements[4][4];
#ifdef MAT4_MATH__USE_SSE
    __m128 Rows[4];
#endif
} my_mat4;

Or not if you don't want to guarantee alignment when the macro is undefined.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Yes, for sure it's alignment problem, the question is, how to make it play nicely with Cython? I made new Pastebin: https://pastebin.com/XrA3C1xH. Added alignas(16) to Elements and compiling with -O3 -march=native -falign-functions=16 flags on my AMD 2400g. The python script still crashes. Any idea how to do it without removing the __m128 member and using _mm_loadu_ps()? – Andrej Kesely Jul 05 '18 at 20:59
  • Or should I do the align from Cython? Something like: ctypedef union my_mat4: alignas(16) float Elements[4][4] ? – Andrej Kesely Jul 05 '18 at 21:06
  • @AndrejKesely: `-falign-functions` is code alignment, not data alignment. It's totally irrelevant. The last section of my answer is only useful if you can get Python to align the objects. If you can't change your code, you could maybe redefine `__m128` as an unaligned vector (https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html). Oh, **if Cython allows you to require alignment in a way that actually works, that would be perfect.** – Peter Cordes Jul 05 '18 at 21:09
  • I fear that Cython doesn't have feature to specify alignment on members of ctypedefs (somebody correct me if I'm wrong!) The solution you proposed, removing the __m128 member from union and make the inline load_vec function seems the best and fool-proof, albeit it's more work. I will accept your answer, thank you! – Andrej Kesely Jul 05 '18 at 21:30
  • @AndrejKesely: Updated my answer with a GNU C/C++ workaround that tells compilers about alignment. – Peter Cordes Jul 05 '18 at 21:41
  • 1
    Yes, I tried your solution with unaligned __m128 and it works! – Andrej Kesely Jul 05 '18 at 22:03