Gcc increased optimization reduces Cython execution speed

Question

I am using MingW64 to compile Cython for Python 3.7. The task is entirely an excerise where I am calculating the Julia set for a grid of points. I am following along with a book, "High Performance Python", which is walking though how to use Cython. I have previously used the -O flag to set optimizations for DSP hardware and gotten significant improvements by setting this to be higher i.e. -O3. This is not the case for the Cython code though.

When doing the same thing with the Cython code it steadily produces slower results as the optimization is increased, which seems to make no sense. The timings I get are:
-O1 = 0.41s
-O2 = 0.46s
-O3 = 0.47s
-Ofast = 0.48s
-Os = 0.419s

Does anyone have an idea for why this would seem to be working in the opposite optimizations?

The cython code is in the file cythonfn.pyx

def calculate_z(maxiter, zs, cs):
    # add type primitives to improve execution time
    cdef unsigned int i, n
    cdef double complex z, c
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs
        # while n < maxiter and abs(z) < 2:
        while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: # increases performance
            z = z * z + c
            n += 1
            
        output[i] = n
    return output

The setup is here as setup.py

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
    cmdclass={'build_ext': build_ext},
    ext_modules=[Extension("calculate", ["cythonfn.pyx"])]
)

And the main code is

import calculate
import time

c_real,c_imag = -0.62772, -0.42193  # point for investigation
x1, x2, y1, y2 = -1.8, 1.8, -1.8, 1.8

def calc_julia(desired_width, max_iter):
    x_step = (float(x2 - x1) / float(desired_width))
    y_step = (float(y1 - y2) / float(desired_width))
    x = []
    y = []
    ycoord = y2
    while ycoord > y1:
        y.append(ycoord)
        ycoord += y_step
    xcoord = x1
    while xcoord < x2:
        x.append(xcoord)
        xcoord += x_step

    zs = []
    cs = complex(c_real, c_imag)
    for ycoord in y:
        for xcoord in x:
            zs.append(complex(xcoord, ycoord))
    st = time.time()
    output = calculate.calculate_z(max_iter, zs, cs)
    et = time.time()
    print(f"Took {et-st} seconds")

if __name__ == '__main__':
    calc_julia(desired_width=1000, max_iter=300)

And yes I am aware that there are a number of things that can be improved here, for example the cs could just be a scalar since they are all the same. This is more an exercise, but it led to surprising results for the optimization.

Most of the time will be spent in the python code (zs[i], cs[i]) so there is nothing to optimize for gcc. The run times are thus the same. Btw without knowing the variance of the run times we cannot state that a version is faster than the other. — ead, Aug 07 '20 at 05:10
I agree that a fair amount of time is spent on the python operations, but there is no reason that the time should increase by using higher optimizations. I am seeing the following results including standard deviations: -O1 = 0.41s +- 0.004s -O2 = 0.46s +- 0.008s -O3 = 0.47s +- 0.005s -Ofast = 0.48s +- 0.009s -Os = 0.419s +- 0.004s So clearly there is a pretty serious execution penalty from using optimizers that should provide faster execution speed, i.e. the run times are not the same. — alwaysmvp45, Aug 07 '20 at 17:01
Some other tips for such microbenchmarks. What happens if you increase the number of iterations 10 or even 100 fold? Do those differences persist? Also, I know you mentioned that this is test code and that there is plenty of room for optimizations, but using typed memoryviews rather than lists for zs, cs, and out should help dramatically. Otherwise you really are just testing the cpython list api and there is not much that cython is doing to speed your code. Cython's html annotation tool can help with identifying potentially slow parts. — CodeSurgeon, Aug 08 '20 at 00:17
I have updated the main code to be the entire code, which was missing the import statements and the explicit call. This is a fully reproducible system, assuming you have MingW64, Cython, and Windows all set up properly. I am aware of the improvements available, I was easily able to reduce the execution time below 0.03s, the objective was understanding how high compiler optimizations can lead to slower execution. It still has a very real performance improvement, going from over 7.5s to 0.81s using only the Cython provided. — alwaysmvp45, Aug 08 '20 at 01:38
Because of caching effects, it's not unusual for code optimized for size to actually be faster than code optimized for (execution) speed. Compiler optimization is always a guess as to what will be faster and often those guesses are wrong. It's ultimately up to you to find the compiler options that produce the best results for your code on your machine, regardless of whether they are the ones that "should" provide faster speed. If `-O1` yields the best results then use it and be happy. — Nate Eldredge, Aug 08 '20 at 01:48
Increasing the maxiter value by a factor of 10 or 100 still show this same trend, and scaled at about 15x slower for 10x increase in iterations, regardless of the optimization level. Changing the desired width may have more complex relationships, as it will quickly become a game of what data gets put into the cache. At width=1000, it will make up about 1/3 of my cache, and so this should not be overly prohibitive, but scaling that by 10x would. This is in reference to CodeSurgeon's comment — alwaysmvp45, Aug 08 '20 at 01:53
One thing you can try if you are curious: from either the gcc manual or the `-v -Q` flags, you can see exactly which optimization options are enabled at `-O1` vs `-O2`. You can then bisect those to see if there is one in particular that causes the performance hit. I would guess it might be inlining that does it. If you're more curious, diff the generated assembly between the two. — Nate Eldredge, Aug 08 '20 at 02:07
FWIW: I was not able to reproduce your timings with gcc-5.4 and Linux: -O1: 633 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) -O2: 604 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) -O3: 628 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each). -O2 might be somewhat faster, but it is not unheard of that -O2 can be faster than -O3. — ead, Aug 08 '20 at 08:47
Interesting, those numbers are a fair bit slower than what I was seeing on Windows, but they are on the same order so it can easily be chalked up to hardware, especially if you are on a laptop or have a smaller L2 + L3 cache, as that will be pretty close to the limits on some processors. I checked again today and got pretty much the same numbers (with recompiling the Cython). I definitely understand how -O2 is faster than -O3 in this instance based on the answer below, still unsure about -O1 being faster but it might be more work than its worth to solve. — alwaysmvp45, Aug 09 '20 at 04:30

score 0 · Accepted Answer · answered Aug 08 '20 at 02:14

It looks like an answer to this is at least partially explained here, gcc optimization flag -O3 makes code slower than -O2.

To summarize the results, using -O3 actually uses a different comparison/branch technique than -O2, so in this task where the comparison is occurring many times the choice of how it is done is important. Because of the predictability in the loop, it adds execution time by using this optimization.

This still does not solve the reason for going from -O1 to -O2 also, but this source is sufficient to describe the one case. Perhaps another knows the answer to this?

Gcc increased optimization reduces Cython execution speed

1 Answers1