Python, numpy performance inconsistency

Question

I am creating a cell-automata, and I was doing a performance test on the code.

I did it like this:

while True:
    time_1 = perf_counter_ns()
    numpy.fnc1()
    time_2 = perf_counter_ns()
    numpy.fnc2()
    time_3 = perf_counter_ns()
    numpy.fnc3()
    time_4 = perf_counter_ns()
    numpy.fnc4()
    time_5 = perf_counter_ns()
    numpy.fnc5()
    time_6 = perf_counter_ns()
    numpy.fnc6()
    time_7 = perf_counter_ns()

    diff_1 = time_2 - time_1
    diff_2 = time_3 - time_2
    diff_3 = time_4 - time_3
    diff_4 = time_5 - time_4
    diff_5 = time_6 - time_5
    diff_6 = time_7 - time_6

    print(results)

However, I found that the runtimes are somewhat inconsistent. I did some longer tests, for about 8 hours.

It seems the perfomance is "jumping around": for an hour it is running quicker, then slower, and within that hour there are smaller sections, when it is quicker, then it is slower...

What is even more disturbing, the length of the slower-quicker periods are lengthening over time. So I am pretty sure it is not a regular system-process.

The tests were conducted on debian 11 AMD Ryzen 5 2600 Six-Core Processor

The GUI was running, but no browser, etc. I monitored the overall processor usage and most of the cores did nothing.

Note that it is a cell-automata, and it evolved during the test, so the input data is not the same, it changes constantly.

However, I don't think that the time it takes to add up arrays would be different, based on the numbers in the arrays...

Also, if the performance change was data-driven, it would be very unlikely to see random data producing about the same runtimes for a whole minute...

Question: What am I seeing??? What causes it???

My hunch is that maybe it has something to do with python thread scheduling, but I don't know, if numpy uses threads and I only have 1 thread in my code...

There is at least a 2x performance difference between the "slower" and the "quicker" state on average, so it would be very beneficial to make the code stick in the "quicker" state.

I included 2 images:

One of the images is the value of diff_3, cycle-after-cycle, zoomed in more-and-more.

The other image is diff_1, diff2, ... diff_6, all in 1 image. At this scale it is difficult to see the details, but diff_3 and diff_5 are somewhat comparable. As you can see, the "quicker" and "slower" periods match up, but not exactly.

On the images there are about 6.5million cycles.

How are your temperatures? Googling "ryzen 2600 thermal throttling" suggests that quite a few people experience throttling with this CPU — slothrop, May 06 '23 at 09:46
There are many possible reasons for this. More profiling informations are needed so not to make wild guesses. To discard the scheduling issues, you can [bind threads to cores](https://stackoverflow.com/a/72330218/12939557). For the frequency scaling, you need to tweak the governor and set the frequency (a low one), and disable any turbo-like mode. You can use `perf stat` to analyse the frequency, NUMA effects, thread migration, etc. This require root privilege (at least to tweak the paranoid file). — Jérôme Richard, May 06 '23 at 13:01
"*but I don't know, if numpy uses threads*" No, Numpy does not use more than one thread except for linear algebra due to BLAS libraries (used by Numpy) being often multi-threaded (OpenBLAS by default on most platforms). You can tweak this if needed (with `OMP_NUM_THREADS` for OpenBLAS). — Jérôme Richard, May 06 '23 at 13:02
I have been testing the thermal throttling theory. It seems to be a very likely culprit. Unfortunately, I have just seen Jérôme Richard's comment, so I just logged all the CPU freqs along with the other data. The distribution of the freqs of individual cores over time is very similar to what I see in the execution time data. Now I am redoing that test, but I also bind the process to a single core, to better see, if they align. — Zoltan K., May 06 '23 at 14:13

score 0 · Accepted Answer · answered May 07 '23 at 12:24

I think @slothrop's comment was on the spot.

I did a longer test, measured the core freq, the temp, compared it to the execution speed, and also plotted a corrected execution speed.

I have also realized, why was the period of the thermal cycling growing: I collected the data in memory, and periodically dumped all of it (not just append). As there was more collected data, the save time grew, as well, giving to processor more time to cool.

On the images:

CPU temperature vs loop execution time.

Raw loop execution time vs frequency-corrected. Both with a 100-long moving average. For easier comparison, the common frequency for the corrected data series was the average core frequency.

I will wait a bit, in case someone wants to add something, or the guys helping out want to post an answer. Then I will accept my own answer. — Zoltan K., May 07 '23 at 12:26

Python, numpy performance inconsistency

1 Answers1