Cython with AVX SIMD: Code runs once correctly but hang if needed to run again

Question

I'm still very new to SIMD intrinsics and 2nd attempt with it thru Cython. After some help from people on here (many thanks), I have this:

from libc.stdlib cimport malloc, free, calloc
cdef extern from "immintrin.h":  # in this example, we use AVX2
    ctypedef float  __m256
    __m256 _mm256_loadu_ps  (float *__P) nogil  
    __m256 _mm256_add_ps    (__m256 __A, __m256 __B) nogil
    __m256 _mm256_mul_ps    (__m256 __A, __m256 __B) nogil
    __m256 _mm256_fmadd_ps  (__m256 __A, __m256 __B, __m256 __C) nogil
    void   _mm256_storeu_ps (float *__P, __m256 __A) nogil  

cdef float [::1] Example_v2 (float *A, float *B, float *C, int n) :
    ### this example for A and B having more than 8 elements, possibly non-divisible by 8    
    cdef:
        __m256 mA, mB, mC
        float *out = <float*> malloc( n * sizeof(float)) 
        float [::1] out_mem = <float [:n]> out
        int i, j, m = <int>( (n-1) / 8) + 1

    with nogil:
        for i in range(m):
            j = 8*i
            mA = _mm256_loadu_ps( &A[j] )
            mB = _mm256_loadu_ps( &B[j] )
            mC = _mm256_loadu_ps( &C[j] )
            _mm256_storeu_ps( &out[j] , _mm256_fmadd_ps( mA, mB, mC ) )

    return out_mem

def run2(float [::1] A, float [::1] B, float [::1] C):
    return Example_v2( &A[0] , &B[0] , &C[0], A.shape[0] )

If n is divisible by 8. numpy A*B+C and cython is identical and that's the ONLY good news.

If n is non-divisible by 8. It can only run once, gives correct answer and then my Spyder will pop up the message below. Second run, without changing anything will result in stall/hang or have same repetition as first run. I don't know how else to fix it.

Another related question: Is there a better way to write the section under the "with nogil" ? Numpy still faster for N = 400, running thru above in loop (100,000). Sigh...

  File "C:\Users\beng_\Anaconda3\lib\site-packages\psutil\_pswindows.py", line 679, in wrapper
    return fun(self, *args, **kwargs)
  File "C:\Users\beng_\Anaconda3\lib\site-packages\psutil\_pswindows.py", line 933, in create_time
    user, system, created = cext.proc_times(self.pid)
ProcessLookupError: [Errno 3] No such process (originated from GetExitCodeProcess != STILL_ACTIVE)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\beng_\Anaconda3\lib\site-packages\psutil\__init__.py", line 373, in _init
    self.create_time()
  File "C:\Users\beng_\Anaconda3\lib\site-packages\psutil\__init__.py", line 723, in create_time
    self._create_time = self._proc.create_time()
  File "C:\Users\beng_\Anaconda3\lib\site-packages\psutil\_pswindows.py", line 681, in wrapper
    raise convert_oserror(err, pid=self.pid, name=self._name)
psutil.NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=9296)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\beng_\Anaconda3\lib\site-packages\qtconsole\manager.py", line 27, in poll
    super(QtKernelRestarter, self).poll()
  File "C:\Users\beng_\Anaconda3\lib\site-packages\jupyter_client\restarter.py", line 113, in poll
    self.kernel_manager.restart_kernel(now=True, newports=newports)
  File "C:\Users\beng_\Anaconda3\lib\site-packages\jupyter_client\manager.py", line 411, in restart_kernel
    self.shutdown_kernel(now=now, restart=True)
  File "C:\Users\beng_\Anaconda3\lib\site-packages\jupyter_client\manager.py", line 371, in shutdown_kernel
    self._kill_kernel()
  File "C:\Users\beng_\Anaconda3\lib\site-packages\spyder\plugins\ipythonconsole\utils\manager.py", line 78, in _kill_kernel
    self.kill_proc_tree(self.kernel.pid)
  File "C:\Users\beng_\Anaconda3\lib\site-packages\spyder\plugins\ipythonconsole\utils\manager.py", line 44, in kill_proc_tree
    parent = psutil.Process(pid)
  File "C:\Users\beng_\Anaconda3\lib\site-packages\psutil\__init__.py", line 346, in __init__
    self._init(pid)
  File "C:\Users\beng_\Anaconda3\lib\site-packages\psutil\__init__.py", line 386, in _init
    raise NoSuchProcess(pid, None, msg)
psutil.NoSuchProcess: psutil.NoSuchProcess no process found with pid 9296

How much faster is NumPy, and what compiler with what options did you use? (And on what CPU?) Does it help to use `aligned_alloc` to get memory that's at least 32 byte aligned? Probably not if you're bound on memory bandwidth. — Peter Cordes, Apr 01 '20 at 17:20
Does this answer your question? [How dangerous is it to access an array out of bounds?](https://stackoverflow.com/questions/15646973/how-dangerous-is-it-to-access-an-array-out-of-bounds) — ead, Apr 01 '20 at 17:37
@ead: when n is not multiple of 8, say n=26, so I can do 3 groups of 8 which would resolve the first 24, then i still have two more elements left to do but which is not multiple of 8, hence that's the problem because I don't know how to handle it. What is best practice ? Do I zero-pad it to be multiple of 8 ? exit the loop and go back to non-SIMD to handle the additional elements ? or is there a better SIMD style flow that handles such cases ? I thought loadu and storeu handle such issue but seems not. — Ong Beng Seong, Apr 01 '20 at 19:04
@PeterCordes : for n=400 and loop thru 100,000... it's 0.089 sec (numpy) vs 0.25 secs. i9-9900K (for now, once the flow is right and i can get real speed up, it will be tested on linux cluster). Anyway, the concern now is to get it to work on own desktop. Compiler options for msvc is /arch:AVX2 and /O2. I'll read up on the aligned_alloc and let you know. Thanks. — Ong Beng Seong, Apr 01 '20 at 19:12
For remain you cannot use SIMD, but the usual floating operations. — ead, Apr 01 '20 at 19:15
Btw even if it fun, a better approach would be probably to write this code in C and call it from Cython passing pointers. You have to look at the generated C-code, which often makes life harder for the optimizer as it looks from the Cython code. — ead, Apr 01 '20 at 19:17
@ead: Yes, that is my other approach and wrap it thru cython. might have a lot less headache. I tried cython first hoping it can work because I was hoping to see more out of cython. — Ong Beng Seong, Apr 01 '20 at 19:24
If you run this 100,000 times you are also going to end up with quite a bit of memory allocated as the memory is never freed; not sure what that does to your timings? — fuglede, Dec 21 '22 at 15:27

Golden Rockefeller · Answer 1 · 2021-04-08T02:56:00.410

You can only use SIMD intrinsic code safely up to the first m bits of data, where m is the largest multiple of s that is less than the number of bits in your data, where s is the size ofyour SIMD data type (e.g. s = 256 bits for type __m256). So in the case of performing SIMD operations on a float array (sizeof(float) = 4 bytes or 32 bits), and using __m256, you can safely use SIMD on the first l//8 (round down) where l is the number of floats in the array.

Option 1: The remaining data should be processed without SIMD.
Option 2: If the instruction can be repeated without affecting the end result, you can have the final SIMD instruction repeated once for the last s bits of valid data.
Option 3: You can copy the data into a new float array with a larger l' size, where l' is a multiple 8, and the extraneous last l'-l floating point results are ignored.

The problem you are having with using SIMD intrinsics is not necessary a limitation of Cython. I imagine you will also get undefined behavior if you used SIMD intrinsics directly in C without addressing the memory safety problem above.

You should also be aware of the effects of memory alignment, pointer aliasing, and whether or not the target system have these instructions.

Option 4: By providing the right compiler options (e.g. "/arch:AVX2" for msvc and "-mavx -ftree-vectorize" for gcc), you could have compiler-provided autovectorization in Cython or C, without writing any SIMD intrinsic code. You may have to make sure that the compiler knows that the pointers are not aliased. One way to do so is to copy the data to locally-declared float arrays first. Another way would be the careful use of the restrict qualifier from C to promise that pointers will not aliased. Cython does not (yet?) recognize the restrict qualifier.

With Cython, you can declare compiler options with a header comment on top of the .pyx or .py file. For example:

# distutils: extra_compile_args = -mavx -ftree-vectorize

Option 4: if your array is bigger than one vector, you can sometimes arrange to do the last partial chunk with an unaligned vector that *ends* at the end of the final element. For odd-sized arrays, this will partially overlap with with the last vector in the main loop. This works for stuff like `c[i] = a[i]+b[i]` where the destination doesn't overlap with any inputs, so you're just doing the same computation in the overlapping elements. Or for things like `a[i] |= 1` where redoing it is fine, or if you can arrange to load that final vector *before* doing the loop. — Peter Cordes, Apr 08 '21 at 01:37
Re: allowed to alias or not: Intel intrinsic types like `__m256` *are* allowed to alias anything else, even things other than `float`: [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605). Also related to handling the final (partial) vector: [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) `vmaskmovps` is also an alternative to scalar cleanup, if you can't find a better way. — Peter Cordes, Apr 08 '21 at 01:39
But yeah, *auto*-vectorization is more likely, or at least more efficient, when you can promise the compiler non-aliasing. Modern gcc/clang *will* often generate code to check for overlap when auto-vectorizing, otherwise falling back to a scalar loop. Overall good points here, +1. Also, you often want to use `-mtune=znver2` or `-mtune=haswell` in addition to `-mavx`. ([Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](https://stackoverflow.com/q/52626726)). Or just use `-march=native` to build for your own machine and enable everything it supports, including FMA. — Peter Cordes, Apr 08 '21 at 01:42
@PeterCordes Aliasing can affect how the code is ran and give wrong results. But I don't want this answer to be a full-on tutorial on vectorization, so I only said that the coder should be aware of these effects. I will add your Option 4 to the answer. — Golden Rockefeller, Apr 08 '21 at 02:04

Cython with AVX SIMD: Code runs once correctly but hang if needed to run again

1 Answers1