Numpy: Why is elementwise binary operation much slower when applied to matrixes with inconsistent memory layout and sizes being big powers of 2?

Question

I have been experimenting with performance of operation in Numpy and it turns out that performing element-wise operation (for example np.multiply) on matrices with inconsistent memory layout (one with order 'C' and second 'F') is around 2 times slower when compared to analogous operation on matrices with matching memory layout. This is only to be expected since memory local referencing has to be taken into account.

What surprises me is that when those matrices have dimensions being multiplication of big power of two the penalty of operation performed on memory inconsistent data is much larger (being 4 and more times slower than its memory matching counterpart).

To emphasize the phenomena element-wise multiplication on memory inconsistent matrices of sizes 1024x1024 is about as costly as matrices with sizes 1800x1800.

My setup (result of running lshw -short):

H/W path                 Device     Class          Description
==============================================================
                                    system         20KN001QPB (LENOVO_MT_20KN_BU_Think_FM_ThinkPad E480)
/0                                  bus            20KN001QPB
/0/3                                memory         16GiB System Memory
/0/3/0                              memory         8GiB SODIMM DDR4 Synchronous Unbuffered (Unregistered) 2400 MHz (0,4 ns)
/0/3/1                              memory         8GiB SODIMM DDR4 Synchronous Unbuffered (Unregistered) 2400 MHz (0,4 ns)
/0/7                                memory         256KiB L1 cache
/0/8                                memory         1MiB L2 cache
/0/9                                memory         6MiB L3 cache
/0/a                                processor      Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
/0/b                                memory         128KiB BIOS
/0/100                              bridge         Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
/0/100/2                            display        UHD Graphics 620
/0/100/8                            generic        Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th ...

Environment:

Python 3.9.2
numpy 1.23.4

Code to reproduce experiment:

import numpy as np
from matplotlib import pyplot as plt
from timeit import default_timer as timer

def measure_time(data, op, rep=10):
    start = timer()
    for i in range(rep):
        op(data)
    return (timer() - start) / rep

def prep_matrix(n, order='C'):
    return np.ones((n, n), order=order)

def mul(data):
    a, b = data
    np.multiply(a, b)

times = [measure_time((prep_matrix(i), prep_matrix(i)), mul, 3) for i in range(0, 3100, 1)]
times2 = [measure_time((prep_matrix(i), prep_matrix(i, order='F')), mul, 3) for i in range(0, 3100, 1)]
plt.plot(times)
plt.plot(times2)
plt.show()

Result (In case stack won't let me publish image appropriate plot can be found here: plot): Where orange line shows execution time in second of matrices multiplication with different memory layout (and blue one with matching layout). X-axis corresponds to size of matrices.

One can clearly see that something suspicious is going on. But why?

I don't know enough about the subject to leave a proper answer, but this has to be due to cache misses. To cite a user who can be blindly trusted in issues of performance: [power of 2 sizes can be bad for caching](/a/9597056/5067311), potentially causing [conflict cache misses](/a/7905816/5067311) or [false aliasing](/a/8547993) or some other constellation of loss of performance. — Andras Deak -- Слава Україні, Nov 13 '22 at 22:46
This is indeed correct. The bigger the power of two the bigger the chance of a conflict miss. For a detailled analysis, I advise you to read [How does the CPU cache affect the performance of a C program](https://stackoverflow.com/questions/71818324/how-does-the-cpu-cache-affect-the-performance-of-a-c-program). This is for a C, but Numpy is written in C. Using NT load/store is not a solution here in Python (and we want to avoid such low-level instructions in Numpy to keep the code maintainable) but using **padding** is a good simple solution (one can do that using a view on a bigger array). — Jérôme Richard, Nov 14 '22 at 01:04
Thanks a lot. Yours answers are exactly what I looked for, but wasn't able to find on my own. — krzysztor, Nov 14 '22 at 09:34
Finally found the animation I was looking for: https://stackoverflow.com/a/48029152/5067311 (to go with the other answer on that question). — Andras Deak -- Слава Україні, Nov 14 '22 at 12:35

Numpy: Why is elementwise binary operation much slower when applied to matrixes with inconsistent memory layout and sizes being big powers of 2?

0 Answers0