Cython execution speed vs. MSVC and GCC versions

Question

Intro

I have a fairly simple Cython module - which I simplified even further for the specific tests I have carried on.

This module has only one class, which has only one interesting method (named run): this method accepts as an input a Fortran-ordered 2D NumPy array and two 1D NumPy arrays, and does some very, very simple things on those (see below for code).

For the sake of benchmarking, I have compiled the exact same module with MSVC, GCC 8, GCC 11, GCC 12 and GCC 13 on Windows 10 64 bit, Python 3.9.10 64 bit, Cython 0.29.32. All the GCC compilers I have obtained from the excellent Brecht Sanders GitHub page (https://github.com/brechtsanders).

Main Question

The overarching question of this very long post is: I am just curious to know if anyone has any explanation regarding why GCC12 and GCC13 are so much slower than GCC11 (which is the fastest of all). Looks like performances are going down at each release of GCC, rather than getting better...

Benchmarks

In the benchmarking, I simply vary the array dimensions of the 2D and 1D arrays (m and n) and the number on nonzero entries in the 2D and 1D arrays. I repeat the run method 20 times per compiler version, per set of m and n and nonzero entries.

Optimization settings I am using:

MVSC

MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL']

GCC

GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables']
GCC_EXTRA_LINK_ARGS = ['-flto'] + GCC_EXTRA_COMPILE_ARGS

What I am observing is the following:

MSVC is by far the slowest at executing the benchmark (why would that be on Windows?)
The progression GCC8 -> GCC11 is promising, as GCC11 is faster than GCC8
GCC12 and GCC13 are both significantly slower than GCC11, with GCC13 being the worst (twice as slow as GCC11 and much worse than GCC12)

Table of Results:

Runtimes are in milliseconds (ms)

Graph (NOTE: Logarithmic Y axis!!):

Runtimes are in milliseconds (ms)

Code

Cython file:

###############################################################################

import numpy as np
cimport numpy as np
import cython

from cython.view cimport array as cvarray

from libc.float cimport FLT_MAX

DTYPE_float = np.float32
ctypedef np.float32_t DTYPE_float_t

cdef float SMALL = 1e-10
cdef int MAXSIZE = 1000000

cdef extern from "math.h" nogil:
    cdef float fabsf(float x)

###############################################################################

@cython.final
cdef class CythonLoops:

    cdef int m, n
    cdef int [:] col_starts
    cdef int [:] row_indices
    cdef double [:] x

    def __cinit__(self):

        self.m = 0
        self.n = 0

        self.col_starts  = cvarray(shape=(MAXSIZE,), itemsize=sizeof(int), format='i')
        self.row_indices = cvarray(shape=(MAXSIZE,), itemsize=sizeof(int), format='i')
        self.x = cvarray(shape=(MAXSIZE,), itemsize=sizeof(double), format='d')

    @cython.boundscheck(False) # turn off bounds-checking for entire function
    @cython.wraparound(False)  # turn off negative index wrapping for entire function
    @cython.nonecheck(False)
    @cython.initializedcheck(False)
    @cython.cdivision(True)
    cpdef run(self, DTYPE_float_t[::1, :] matrix,
                    DTYPE_float_t[:]      ub_values,
                    DTYPE_float_t[:]      priority):

        cdef Py_ssize_t i, j, m, n
        cdef int nza, collen
        cdef double too_large, ok, obj
        cdef float ub, element

        cdef int [:] col_starts = self.col_starts
        cdef int [:] row_indices = self.row_indices
        cdef double [:] x = self.x

        m = matrix.shape[0]
        n = matrix.shape[1]

        self.m = m
        self.n = n

        nza = 0
        collen = 0
        for i in range(n):
            for j in range(m+1):
                if j == 0:
                    element = priority[i]
                else:
                    element = matrix[j-1, i]

                if fabsf(element) < SMALL:
                    continue

                if j == 0:
                    obj = <double>element
                    # Do action 1 with external library
                else:
                    collen = nza + 1
                    col_starts[collen] = i+1
                    row_indices[collen] = j
                    x[collen] = <double>element
                    nza += 1

            ub = ub_values[i]

            if ub > FLT_MAX:
                too_large = 0.0
                # Do action 2 with external library
            elif ub > SMALL:
                ok = <double>ub
                # Do action 3 with external library

        # Use x, row_indices and col_starts in the external library

Setup file:

I use the following to compile it:

python setup.py build_ext --inplace --compiler=mingw32 gcc13

Where the last argument is the compiler I want to test

#!/usr/bin/env python

from setuptools import setup
from setuptools import Extension

from Cython.Build import cythonize
from Cython.Distutils import build_ext

import numpy as np
import os
import shutil
import sys
import getpass


MODULE = 'loop_cython_%s'

GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables']
GCC_EXTRA_LINK_ARGS = ['-flto'] + GCC_EXTRA_COMPILE_ARGS

MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL']
MSVC_EXTRA_LINK_ARGS = MSVC_EXTRA_COMPILE_ARGS


def remove_builds(kind):

    for folder in ['build', 'bin']:
        if os.path.isdir(folder):
            if folder == 'bin':
                continue
            shutil.rmtree(folder, ignore_errors=True)

    if os.path.isfile(MODULE + '_%s.c'%kind):
        os.remove(MODULE + '_%s.c'%kind)


def setenv(extra_args, doset=True, path=None, kind='gcc8'):

    flags = ''
    if doset:
        flags = '  '.join(extra_args)

    for key in ['CFLAGS', 'FFLAGS', 'CPPFLAGS']:
        os.environ[key] = flags

    user = getpass.getuser()

    if doset:
        path = os.environ['PATH']
        if kind == 'gcc8':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_8.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'gcc11':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'gcc12':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_12.2.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'gcc13':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_13.2.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'msvc':
            os.environ['PATH'] = r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\bin\Hostx64\x64;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\J0514162\WinPython39\WPy64-39100\python-3.9.10.amd64;C:\Program Files (x86)\Windows Kits\10\bin\10.0.22000.0\x64'
            os.environ['LIB'] = r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\lib\x64;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22000.0\um\x64;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22000.0\ucrt\x64'
            os.environ["DISTUTILS_USE_SDK"] = '1'
            os.environ["MSSdk"] = '1'

    else:
        os.environ['PATH'] = path

    return path

class CustomBuildExt(build_ext):
    def build_extensions(self):
        # Override the compiler executables. Importantly, this
        # removes the "default" compiler flags that would
        # otherwise get passed on to to the compiler, i.e.,
        # distutils.sysconfig.get_var("CFLAGS").
        self.compiler.set_executable("compiler_so", "gcc -mdll -O -Wall -DMS_WIN64")
        self.compiler.set_executable("compiler_cxx", "g++ -O -Wall -DMS_WIN64")
        self.compiler.set_executable("linker_so", "gcc -shared -static")
        self.compiler.dll_libraries = []
        build_ext.build_extensions(self)

if __name__ == '__main__':

    os.system('cls')

    kind = None
    for arg in sys.argv:
        if arg.strip() in ['gcc8', 'gcc11', 'gcc12', 'gcc13', 'msvc']:
            kind = arg
            sys.argv.remove(arg)
            break

    base_file = os.path.join(os.getcwd(), MODULE[0:-3])


    source = base_file + '.pyx'
    target = base_file + '_%s.pyx'%kind
    shutil.copyfile(source, target)

    if kind == 'msvc':
        extra_compile_args = MSVC_EXTRA_COMPILE_ARGS[:]
        extra_link_args = MSVC_EXTRA_LINK_ARGS[:] + ['/MANIFEST']
    else:
        extra_compile_args = GCC_EXTRA_COMPILE_ARGS[:]
        extra_link_args = GCC_EXTRA_LINK_ARGS[:]

    path = setenv(extra_compile_args, kind=kind)
    remove_builds(kind)

    define_macros = [('WIN32', 1)]

    nname = MODULE%kind

    include_dirs = [np.get_include()]
    if kind == 'msvc':
        include_dirs += [r'C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\ucrt',
                         r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\include',
                         r'C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\shared']

    extensions = [
        Extension(nname, [nname + '.pyx'],
                  extra_compile_args=extra_compile_args,
                  extra_link_args=extra_link_args,
                  include_dirs=include_dirs,
                  define_macros=define_macros)]

    # build the core extension(s)
    setup_kwargs = {'ext_modules': cythonize(extensions,
                                             compiler_directives={'embedsignature'  : False,
                                                                  'boundscheck'     : False,
                                                                  'wraparound'      : False,
                                                                  'initializedcheck': False,
                                                                  'cdivision'       : True,
                                                                  'language_level'  : '3str',
                                                                  'nonecheck'       : False},
                                             force=True,
                                             cache=False,
                                             quiet=False)}

    if kind != 'msvc':
        setup_kwargs['cmdclass'] = {'build_ext': CustomBuildExt}

    setup(**setup_kwargs)

    setenv([], False, path)
    remove_builds(kind)

Test code:

import os
import numpy
import time

import loop_cython_msvc as msvc
import loop_cython_gcc8 as gcc8
import loop_cython_gcc11 as gcc11
import loop_cython_gcc12 as gcc12
import loop_cython_gcc13 as gcc13

              # M     N     NNZ(matrix) NNZ(priority) NNZ(ub)
DIMENSIONS = [(1661 , 2608 , 3560      , 375         , 2488 ),
              (2828 , 3512 , 4333      , 413         , 2973 ),
              (780  , 985  , 646       , 23          , 984  ),
              (799  , 1558 , 1883      , 301         , 1116 ),
              (399  , 540  , 388       , 44          , 517  ),
              (10545, 10486, 14799     , 1053        , 10041),
              (3369 , 3684 , 3684      , 256         , 3242 ),
              (2052 , 5513 , 4772      , 1269        , 3319 ),
              (224  , 628  , 1345      , 396         , 594  ),
              (553  , 1475 , 1315      , 231         , 705  )]


def RunTest():

    print('M      N      NNZ     MSVC     GCC 8    GCC 11   GCC 12   GCC 13')

    for m, n, nnz_mat, nnz_priority, nnz_ub in DIMENSIONS:
        print('%-6d %-6d %-8d'%(m, n, nnz_mat), end='')

        for solver, label in zip([msvc, gcc8, gcc11, gcc12, gcc13], ['MSVC', 'GCC 8', 'GCC 11', 'GCC 12', 'GCC 13']):

            numpy.random.seed(123456)
            size = m*n
            idxes = numpy.arange(size)
            matrix = numpy.zeros((size, ), dtype=numpy.float32)
            idx_mat = numpy.random.choice(idxes, nnz_mat)
            matrix[idx_mat] = numpy.random.uniform(0, 1000, size=(nnz_mat, ))

            matrix = numpy.asfortranarray(matrix.reshape((m, n)))

            idxes = numpy.arange(m)
            priority = numpy.zeros((m, ), dtype=numpy.float32)
            idx_pri = numpy.random.choice(idxes, nnz_priority)
            priority[idx_pri] = numpy.random.uniform(0, 1000, size=(nnz_priority, ))

            idxes = numpy.arange(n)
            ub_values = numpy.inf*numpy.ones((n, ), dtype=numpy.float32)
            idx_ub = numpy.random.choice(idxes, nnz_ub)
            ub_values[idx_ub] = numpy.random.uniform(0, 1000, size=(nnz_ub, ))

            solver = solver.CythonLoops()
            time_tot = []

            for i in range(20):

                start = time.perf_counter()
                solver.run(matrix, ub_values, priority)
                elapsed = time.perf_counter() - start
                time_tot.append(elapsed*1e3)

            print('%-8.4g'%numpy.mean(time_tot), end=' ')

        print()

if __name__ == '__main__':

    os.system('cls')
    RunTest()

EDIT

After @PeterCordes comments, I have changed the optimization flags to this:

MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL', '/QIntel-jcc-erratum']

GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables', '-Wa,-mbranches-within-32B-boundaries']

MSVC appears to be marginally faster than before (between 5% and 10%), but GCC12 and GCC13 are slower than before (between 3% and 20%). Below a graph with the results on the largest 2D matrix:

Note: "Current" means with the latest optimization flags suggested by @PeterCordes, "Previous" is the original set of flags.

*MSVC is by far the slowest at executing the benchmark (why would that be on Windows?)* - because MSVC is generally not good at optimizing, especially for AVX-512. https://www.agner.org/forum/viewtopic.php?f=1&t=88&sid=8b953990ae2c25c4211539d599b0071b What CPU are you using? Zen 4? Ice Lake? — Peter Cordes, Aug 03 '23 at 06:39
Is this code intentionally hard for compilers to optimize for the sake of experiment? `if j == 0:` inside the loop looks weird to me, instead of putting that logic outside the inner loop. (Duplicating the `if fabsf(element) < SMALL:` manually would be a small price to pay vs. complicating the control-flow for human readers and the compiler; you could maybe mislead the compiler into branching each iteration.) — Peter Cordes, Aug 03 '23 at 07:36
What's the actual work here? Packing non-small elements of the matrix into `x[]` and indices into other arrays? AVX-512 `vcompressps` can massively speed that up vs. AVX2 if the source data is contiguous, so definitely check if compilers managed to auto-vectorize it. (See [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/a/36949618) for AVX-512 left-packing, and my other answer there for AVX2 + BMI2.) I assume `# Do action 1 with external library` doesn't take much time since your [mcve] doesn't show what it is. — Peter Cordes, Aug 03 '23 at 07:38
@PeterCordes, CPU is Skylake-avx512. The fact that my sample is reduced and doesn't show what to do with the external library is not fundamental. The original code is in Cython, which is basically Python with a few annotation to statically type some variables - I have no idea how to use AVX-specific instructions in there. My question on why GCC13 is so much slower than GCC 11 is a curiosity after my investigation - but for sure it will make me stay with GCC 11 for our applications — Infinity77, Aug 03 '23 at 09:00
Yes, Cython translates to C and compiles that. So the question is how good the asm is and why. Presumably the C has a similar structure to the Python loops, so making the Python logic more compiler-friendly would help. Anyway, I was curious whether GCC 12 and 13 were stumbling over the awkward `if (j==0)` inside the loop, and whether restructuring it would make them run as fast as GCC11. Is this code with the placeholder comments the actual [mcve] that you benchmarked, or did that run include calls that couldn't inline? That could make a huge different to auto-vectorization. — Peter Cordes, Aug 03 '23 at 09:05
Since you have a Skylake-X CPU, have you tried `-Wa,-mbranches-within-32B-boundaries`? Later GCC versions might just be getting unlucky with [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) when you don't turn on options to mitigate it. (It's unfortunately not part of `-march=skylake-avx512` or `-march=native` on Skylake-family CPUs). — Peter Cordes, Aug 03 '23 at 09:07
@PeterCordes, I have added the flags you suggested and re-run the benchmarks. I have added some edits to the original post. — Infinity77, Aug 03 '23 at 09:36
It is painful to get cython to emit code, which can be well optimized by compilers (in my experience). E.g. `DTYPE_float_t[:] ub_values` should probably be `DTYPE_float_t[::1]` otherwise compiler will not optimize for stepwidth=1. You should look at the generated C-code to see, if there are some possible pitfalls. — ead, Aug 03 '23 at 12:49
@ead I have changed that declaration for `ub_values` and `priority`. No changes in runtimes. I have run `cython - a cython_file.pyx` and the whole `run` method is white, meaning no Python interaction. The generated C code is 25,000 lines long, and basically impenetrable. — Infinity77, Aug 04 '23 at 06:11
I ported the code to Pythran, which is usually better at vectorising, and there is no meaningful difference on my machine (Linux, GCC12). — Davidmh, Aug 04 '23 at 16:30
@Infinity77 most of the 25,000 lines is aren't hugely relevant to what you're doing. What you can do usefully is search for the *cython* code that you're interested in (because it's all embedded as comments). That'll get you to the smaller section that actually represents your function. It may not help, but it's not as impenetrable as it first seems! — DavidW, Aug 04 '23 at 17:02

score 2 · Accepted Answer · edited Aug 05 '23 at 08:53

This is a partial answer providing the generated C code produced by Cython once it has been simplified a bit to be shorter, more human-readable and easy to compile without any Cython, Python, or NumPy dependencies (the transformations are not expected to drastically impact the timings). It also show the generated assembly code and few comments.

Here is the C code:

#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <float.h>

/* Remove external dependencies */
typedef int64_t Py_ssize_t;
typedef void memoryview_obj;
typedef void* VTable;

typedef struct {
  struct memoryview_obj *memview;
  char *data;
  Py_ssize_t shape[8];
  Py_ssize_t strides[8];
  Py_ssize_t suboffsets[8];
} memviewslice;

struct Self
{
    /* PyObject_HEAD */
    struct VTable* tab;
    int m;
    int n;
    memviewslice col_starts;
    memviewslice row_indices;
    memviewslice x;
};

static float test_SMALL = 1e-10;

extern void INC_MEMVIEW(memviewslice* slice, int count);
extern void XDEC_MEMVIEW(memviewslice* slice, int count);

void run(struct Self* self, memviewslice matrix, memviewslice ub_values, memviewslice priority)
{
    Py_ssize_t i;
    Py_ssize_t j;
    Py_ssize_t m;
    Py_ssize_t n;
    int nza;
    int collen;
    double too_large;
    double ok;
    double obj;
    float ub;
    float element;
    memviewslice col_starts = { 0, 0, { 0 }, { 0 }, { 0 } };
    memviewslice row_indices = { 0, 0, { 0 }, { 0 }, { 0 } };
    memviewslice x = { 0, 0, { 0 }, { 0 }, { 0 } };
    memviewslice t_1 = { 0, 0, { 0 }, { 0 }, { 0 } };
    memviewslice t_2 = { 0, 0, { 0 }, { 0 }, { 0 } };
    Py_ssize_t t_3;
    Py_ssize_t t_4;
    Py_ssize_t t_6;
    Py_ssize_t t_7;
    int t_9;
    Py_ssize_t t_10;
    Py_ssize_t t_11;

    t_1 = self->col_starts;
    INC_MEMVIEW(&t_1, 1);
    col_starts = t_1;
    t_1.memview = NULL;
    t_1.data = NULL;

    t_1 = self->row_indices;
    INC_MEMVIEW(&t_1, 1);
    row_indices = t_1;
    t_1.memview = NULL;
    t_1.data = NULL;

    t_2 = self->x;
    INC_MEMVIEW(&t_2, 1);
    x = t_2;
    t_2.memview = NULL;
    t_2.data = NULL;

    m = matrix.shape[0];
    n = matrix.shape[1];
    self->m = m;
    self->n = n;
    nza = 0;
    collen = 0;

    t_3 = n;
    t_4 = t_3;
    for (i = 0; i < t_4; i++)
    {
        t_6 = m + 1;
        t_7 = t_6;

        for (j = 0; j < t_7; j++)
        {
            t_9 = (j == 0) != 0;

            if (t_9)
            {
                t_10 = i;
                element = (*((float*) ( /* dim=0 */ (priority.data + t_10 * priority.strides[0]) )));

                goto L7;
            }

            {
                t_10 = j - 1;
                t_11 = i;
                element = (*((float*) ( /* dim=1 */ (( /* dim=0 */ ((char *) (((float*) matrix.data) + t_10)) ) + t_11 * matrix.strides[1]) )));
            }

            L7:;

            t_9 = (fabsf(element) < test_SMALL) != 0;
            if (t_9)
            {
                goto L5_continue;
            }

            t_9 = ((j == 0) != 0);
            if (t_9)
            {

                obj = (double)element;

                goto L9;
            }

            {
                collen = nza + 1;

                t_11 = collen;
                *((int *) ( /* dim=0 */ (col_starts.data + t_11 * col_starts.strides[0]) )) = i + 1;

                t_11 = collen;
                *((int *) ( /* dim=0 */ (row_indices.data + t_11 * row_indices.strides[0]) )) = j;

                t_11 = collen;
                *((double *) ( /* dim=0 */ (x.data + t_11 * x.strides[0]) )) = ((double)element);

                nza = nza + 1;
            }

            L9:;
            L5_continue:;
        }

        t_11 = i;
        ub = (*((float*) ( /* dim=0 */ (ub_values.data + t_11 * ub_values.strides[0]) )));

        t_9 = (ub > FLT_MAX) != 0;
        if (t_9)
        {
            too_large = 0.0;
            goto L10;
        }

        t_9 = (ub > test_SMALL) != 0;
        if (t_9)
        {
            ok = (double)ub;
        }

        L10:;
    }

    XDEC_MEMVIEW(&col_starts, 1);
    XDEC_MEMVIEW(&row_indices, 1);
    XDEC_MEMVIEW(&x, 1);
}

Here is the resulting main loop compiled in assembly using GCC 11.4 (with the compiler option -Ofast -funroll-loops -ftree-vectorize -march=skylake -fno-asynchronous-unwind-tables):

.L4:
        lea     rdi, [r12-1]
        xor     eax, eax
        and     edi, 3
        je      .L5
        vmovss  xmm4, DWORD PTR [rcx]
        mov     eax, 1
        vandps  xmm5, xmm4, xmm1
        vcomiss xmm9, xmm5
        ja      .L25
        inc     edx
        movsx   r14, edx
        mov     r15, r9
        imul    r15, r14
        vcvtss2sd       xmm6, xmm4, xmm4
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, r14
        imul    r14, rbp
        mov     DWORD PTR [r10+r15], 1
        vmovsd  QWORD PTR [rbx+r14], xmm6
.L25:
        cmp     rdi, 1
        je      .L5
        cmp     rdi, 2
        je      .L26
        inc     rax
        vmovss  xmm7, DWORD PTR [rcx-4+rax*4]
        vandps  xmm2, xmm7, xmm1
        vcomiss xmm9, xmm2
        ja      .L26
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm3, xmm7, xmm7
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm3
.L26:
        inc     rax
        vmovss  xmm6, DWORD PTR [rcx-4+rax*4]
        vandps  xmm8, xmm6, xmm1
        vcomiss xmm9, xmm8
        jbe     .L36
        jmp     .L5
.L7:
        vmovss  xmm10, DWORD PTR [rcx-4+rax*4]
        vandps  xmm11, xmm10, xmm1
        vcomiss xmm9, xmm11
        ja      .L24
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm12, xmm10, xmm10
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm12
.L24:
        lea     r14, [rax+1]
        vmovss  xmm13, DWORD PTR [rcx-4+r14*4]
        vandps  xmm14, xmm13, xmm1
        vcomiss xmm9, xmm14
        ja      .L27
        inc     edx
        movsx   rdi, edx
        mov     r15, r9
        imul    r15, rdi
        vcvtss2sd       xmm15, xmm13, xmm13
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r10+r15], r14d
        vmovsd  QWORD PTR [rbx+rdi], xmm15
.L27:
        lea     r14, [rax+2]
        vmovss  xmm0, DWORD PTR [rcx-4+r14*4]
        vandps  xmm4, xmm0, xmm1
        vcomiss xmm9, xmm4
        ja      .L28
        inc     edx
        movsx   rdi, edx
        mov     r15, r9
        imul    r15, rdi
        vcvtss2sd       xmm5, xmm0, xmm0
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r10+r15], r14d
        vmovsd  QWORD PTR [rbx+rdi], xmm5
.L28:
        add     rax, 3
        vmovss  xmm6, DWORD PTR [rcx-4+rax*4]
        vandps  xmm7, xmm6, xmm1
        vcomiss xmm9, xmm7
        ja      .L5
.L36:
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm2, xmm6, xmm6
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm2
.L5:
        inc     rax
        cmp     r12, rax
        jne     .L7
        lea     rax, [rsi+1]
        add     rcx, QWORD PTR [rsp+8]
        cmp     r13, rsi
        je      .L2
        mov     rsi, rax
        jmp     .L4
.L2:

Same with GCC 12.2:

.L4:
        lea     rdi, [r12-1]
        xor     eax, eax
        and     edi, 3
        je      .L5
        vmovss  xmm15, DWORD PTR [rcx]
        mov     eax, 1
        vandps  xmm4, xmm15, xmm1
        vcomiss xmm0, xmm4
        ja      .L27
        inc     edx
        movsx   r14, edx
        mov     r15, r9
        imul    r15, r14
        vcvtss2sd       xmm5, xmm15, xmm15
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, r14
        imul    r14, rbp
        mov     DWORD PTR [r10+r15], 1
        vmovsd  QWORD PTR [rbx+r14], xmm5
.L27:
        cmp     rdi, 1
        je      .L5
        cmp     rdi, 2
        je      .L28
        inc     rax
        vmovss  xmm6, DWORD PTR [rcx-4+rax*4]
        vandps  xmm7, xmm6, xmm1
        vcomiss xmm0, xmm7
        ja      .L28
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm2, xmm6, xmm6
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm2
.L28:
        inc     rax
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vandps  xmm3, xmm5, xmm1
        vcomiss xmm0, xmm3
        jbe     .L39
        jmp     .L5
.L41:
        vmovss  xmm8, DWORD PTR [rcx-4+rax*4]
        vandps  xmm9, xmm8, xmm1
        vcomiss xmm0, xmm9
        ja      .L26
        inc     edx
        movsx   r15, edx
        mov     r14, r9
        mov     rdi, r11
        imul    r14, r15
        imul    rdi, r15
        imul    r15, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm10, xmm8, xmm8
        mov     DWORD PTR [r10+rdi], eax
        vmovsd  QWORD PTR [rbx+r15], xmm10
.L26:
        lea     r14, [rax+1]
        vmovss  xmm11, DWORD PTR [rcx-4+r14*4]
        vandps  xmm12, xmm11, xmm1
        vcomiss xmm0, xmm12
        ja      .L29
        inc     edx
        movsx   rdi, edx
        mov     r15, r9
        imul    r15, rdi
        vcvtss2sd       xmm13, xmm11, xmm11
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r10+r15], r14d
        vmovsd  QWORD PTR [rbx+rdi], xmm13
.L29:
        lea     r14, [rax+2]
        vmovss  xmm14, DWORD PTR [rcx-4+r14*4]
        vandps  xmm15, xmm14, xmm1
        vcomiss xmm0, xmm15
        ja      .L30
        inc     edx
        movsx   rdi, edx
        mov     r15, r9
        imul    r15, rdi
        vcvtss2sd       xmm4, xmm14, xmm14
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r10+r15], r14d
        vmovsd  QWORD PTR [rbx+rdi], xmm4
.L30:
        add     rax, 3
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vandps  xmm6, xmm5, xmm1
        vcomiss xmm0, xmm6
        ja      .L5
.L39:
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm7, xmm5, xmm5
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm7
.L5:
        inc     rax
        cmp     r12, rax
        jne     .L41
        mov     rdi, QWORD PTR [rsp+8]
        lea     rax, [rsi+1]
        add     rcx, rdi
        cmp     rsi, r13
        je      .L2
        mov     rsi, rax
        jmp     .L4
.L2:

Here is the same with GCC 13.1 (GCC 13.2 is not available on Godbolt):

.L4:
        mov     QWORD PTR [rsp+16], r14
        xor     eax, eax
.L20:
        mov     r8, rax
        sub     r8, r9
        not     r8
        and     r8d, 3
        je      .L6
        inc     rax
        vmovss  xmm15, DWORD PTR [rcx-4+rax*4]
        vandps  xmm0, xmm15, xmm2
        vcomiss xmm1, xmm0
        ja      .L20
        inc     edx
        mov     r15, r10
        mov     rdi, QWORD PTR [rsp+24]
        vcvtss2sd       xmm4, xmm15, xmm15
        movsx   r14, edx
        imul    r15, r14
        mov     DWORD PTR [rdi+r15], esi
        mov     r15, rbx
        imul    r15, r14
        imul    r14, r13
        mov     DWORD PTR [r11+r15], eax
        vmovsd  QWORD PTR [r12+r14], xmm4
        cmp     r8, 1
        je      .L6
        cmp     r8, 2
        je      .L21
        inc     rax
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vandps  xmm6, xmm5, xmm2
        vcomiss xmm1, xmm6
        ja      .L20
        inc     edx
        mov     r14, r10
        vcvtss2sd       xmm7, xmm5, xmm5
        movsx   r8, edx
        imul    r14, r8
        mov     DWORD PTR [rdi+r14], esi
        mov     rdi, rbx
        imul    rdi, r8
        imul    r8, r13
        mov     DWORD PTR [r11+rdi], eax
        vmovsd  QWORD PTR [r12+r8], xmm7
.L21:
        inc     rax
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vandps  xmm3, xmm5, xmm2
        vcomiss xmm1, xmm3
        ja      .L20
        inc     edx
        mov     r14, r10
        mov     rdi, QWORD PTR [rsp+24]
        movsx   r8, edx
        imul    r14, r8
.L32:
        mov     r15, rbx
        mov     DWORD PTR [rdi+r14], esi
        vcvtss2sd       xmm8, xmm5, xmm5
        imul    r15, r8
        imul    r8, r13
        mov     DWORD PTR [r11+r15], eax
        vmovsd  QWORD PTR [r12+r8], xmm8
.L6:
        lea     r8, [rax+1]
        mov     rax, r8
        cmp     r9, r8
        je      .L7
        vmovss  xmm9, DWORD PTR [rcx-4+r8*4]
        vandps  xmm10, xmm9, xmm2
        vcomiss xmm1, xmm10
        ja      .L20
        inc     edx
        mov     r15, r10
        mov     rdi, QWORD PTR [rsp+24]
        vcvtss2sd       xmm11, xmm9, xmm9
        movsx   r14, edx
        mov     DWORD PTR [rsp+4], edx
        imul    r15, r14
        mov     DWORD PTR [rdi+r15], esi
        mov     r15, rbx
        imul    r15, r14
        imul    r14, r13
        mov     DWORD PTR [r11+r15], eax
        inc     rax
        vmovss  xmm12, DWORD PTR [rcx-4+rax*4]
        vmovsd  QWORD PTR [r12+r14], xmm11
        vandps  xmm13, xmm12, xmm2
        vcomiss xmm1, xmm13
        ja      .L20
        inc     edx
        mov     r15, r10
        vcvtss2sd       xmm14, xmm12, xmm12
        movsx   r14, edx
        imul    r15, r14
        mov     DWORD PTR [rdi+r15], esi
        mov     r15, rbx
        imul    r15, r14
        imul    r14, r13
        mov     DWORD PTR [r11+r15], eax
        lea     rax, [r8+2]
        vmovss  xmm15, DWORD PTR [rcx-4+rax*4]
        vmovsd  QWORD PTR [r12+r14], xmm14
        vandps  xmm0, xmm15, xmm2
        vcomiss xmm1, xmm0
        ja      .L20
        mov     edx, DWORD PTR [rsp+4]
        mov     r15, r10
        vcvtss2sd       xmm4, xmm15, xmm15
        add     edx, 2
        movsx   r14, edx
        imul    r15, r14
        mov     DWORD PTR [rdi+r15], esi
        mov     r15, rbx
        imul    r15, r14
        imul    r14, r13
        mov     DWORD PTR [r11+r15], eax
        lea     rax, [r8+3]
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vmovsd  QWORD PTR [r12+r14], xmm4
        vandps  xmm6, xmm5, xmm2
        vcomiss xmm1, xmm6
        ja      .L20
        mov     edx, DWORD PTR [rsp+4]
        mov     r14, r10
        add     edx, 3
        movsx   r8, edx
        imul    r14, r8
        jmp     .L32
.L7:
        mov     rdi, QWORD PTR [rsp+8]
        mov     r14, QWORD PTR [rsp+16]
        lea     rax, [rsi+1]
        add     rcx, rdi
        cmp     r14, rsi
        je      .L2
        mov     rsi, rax
        jmp     .L4
.L2:

Discussion

First of all, we can see that the main loop is full of conditional branches mixed with goto. Such kind of code are generally not written by humans. Compiler can expect them to be rare and so optimize them less efficiently than ones without goto (since assertions are easier to compute). The thing is the j == 0 condition is not really useful here: it can be done only once before the j-based loop. GCC does not seem to see that (especially due to the complex control-flow, amplified by Cython goto instructions). On to of that, the compiler do not know if fabsf(element) < SMALL is often true or often false, or even just hard to predict at runtime. If the loop is easy to predict (the first two cases), then a good compiler can write a more efficient code than the one produced by GCC (possibly using SIMD instructions).

Actually, I am not even sure -funroll-loops is useful here because of the complex control-flow. It makes the hot-loop significantly bigger (i.e., more space in caches) and the processor migh take more time to predict the probability of the taken/not-taken branches. This is dependent of the target processor. Having smaller code is also good for profiling and optimizing your code: simpler code are easier to optimize.

To optimize the code, the compiler make assumptions on the cost of the loops and the probability of the branches to be taken. If can assume for example that the inner loop if far more expensive than the outer loop. Heuristics are used to optimize the code and their parameters are carefully tuned so to maximize the performance of benchmark suites (and avoid regressions in code like the Linux kernel for GCC). Thus, a side effect is that your code can be slower while most code can be faster using a newer version of GCC, or most compilers. This is also one reason the performance of a given compiler is not very stable from one version to another. For example, if a matrix has a lot of zeros, then fabsf(element) < SMALL will be often true, causing the outer-loop to be more expensive than expected. The thing is the outer loop might less well optimized in GCC 13.2 than GCC 12.2, resulting in a slower execution time with GCC 13.2. For example, GCC 13.2 starts the outer loop with:

        mov     r8, rax
        sub     r8, r9
        not     r8
        and     r8d, 3
        je      .L6

Note that modern processors, like Skylake, can execute multiple instructions in parallel per cycle. This sequence of instructions can only be executed sequentially because of the dependency chain on the r8 register. Meanwhile, GCC 12.2 generate this in the beginning of the outer loop:

        lea     rdi, [r12-1]
        xor     eax, eax
        and     edi, 3
        je      .L5

The dependency chain (on rdi/edi) is apparently significantly smaller in this case. In practice, the assembly code is pretty big, so I do not expect this to be the only issue. One need to track the actual path taken for the input matrices in all versions of GCC (very tedious), or just run the code and profile it with tools like VTune (the recommended solution in this case).

We can see that there is a lot of imul instructions. Such instructions are typically due to the strides of the array that are not known at compile-time. GCC does not generate code for the strides that are 1. You should really mention this information in the memory views using ::1 if possible (as mentioned by ead in the comments) since imul instructions tends to be a bit expensive in hot loops like this (not to mention they can also increase the size of the code which is already bigger due to the partial unrolling).

Thank you for the very extensive analysis, impressive. Even if it doesn’t fully answer my question, I’ll consider my curiosity satisfied as it appears the problem is much more complicated than I thought. Thank you for the insights! — Infinity77, Aug 04 '23 at 18:50
@Infinity77: The question you asked was why later GCC versions have worse performance on the messy C code created by Cython for your hard-to-optimize source, and/or why the asm it makes is slower on current x86. IDK how much detail you want to get into with GCC internals, but the explanation about spaghetti goto logic defeating some GCC optimizations makes sense. — Peter Cordes, Aug 05 '23 at 08:40
@Infinity77: There are various related questions which might be more useful in practice, like how to write Python source code that Cython can turn into C that compiles better. (e.g. hoist the `j==0` case out of the loop as Jerome and I suggested (in comments and this answer), and the explicit stride=1). IDK if you wrote it that way intentionally to make it hard for the compiler and demonstrate the GCC12/13 regressions? Either way, maybe you should report it as a GCC missed-optimization / regression on https://gcc.gnu.org/bugzilla/ — Peter Cordes, Aug 05 '23 at 08:41
@PeterCordes: I will definitely follow the suggestion of moving the j==0 branch outside the loop. The explicit stride=1 made no difference at all for any compiler. As for the purpose of the code, I didn’t write it like that to show any regression - it is part of a larger library that I use and I found out (with great disappointment) that using newer version of GCC to compile it have much worse performance. I’ll see if I can merge Jerome’s answer and my question to make a meaningful regression report to GCC, although I’m not sure my use case is that important… — Infinity77, Aug 05 '23 at 09:10