Why is performance of in-place modification to a numpy array related to the order of dimension being modified?

Question

import numpy as np
a = np.random.random((500, 500, 500))
b = np.random.random((500, 500))

%timeit a[250, :, :] = b
%timeit a[:, 250, :] = b
%timeit a[:, :, 250] = b

107 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52 µs ± 88.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.59 ms ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Observations:

Performance across the three runs above is different: modifying the 1st and 3rd dimension (slicing the 2nd) is the fastest among the three, while modifying the 1st and 2nd dimension (slicing the 3rd) is the slowest.
There seems no monotonicity of speed w.r.t. the dimension being sliced.

Questions are:

What mechanism behind numpy makes my observations?
With answer to 1st question in mind, how to speed up my code by arranging dimensions properly, as some dimensions are modified in bulk and the rests are just being sliced?

After running the initializing snippet once, I ran your `timeit` snippet several times, then picked the best time for line1, best time for line2, and best time for line3, of your snippet. I find the best times to be in ascending order for line1 : line2 : line3 as `6` : `11` : `88` — fountainhead, Dec 26 '20 at 07:10
That the last is slowest make sense, since none of the elements in `a[:,:,250]` are contiguous. Why the middle is about 2x faster than the first is harder to reason, and may not be worth worrying about. I found it useful to look at the `strides` in the respective calculaitons. — hpaulj, Dec 26 '20 at 07:29
Take a look at [Locality of reference](https://en.wikipedia.org/wiki/Locality_of_reference), it explains why contiguous memory access is faster. — cs95, Dec 26 '20 at 09:50
@fountainhead This might be CPU-related. The result I posted above comes from a AMD R5-3600 PC (results are similar regardless I'm using MKL, OpenBLAS or AMD's own BLIS), while on a Intel i7-6700 PC (with MKL) I got `12`:`13`:`396`. — JShi, Dec 26 '20 at 11:15
@hpaulj: indeed, and on the several machine/archs I tried, there is no meaningful timing difference between the middle case and the first. — Pierre D, Dec 27 '20 at 15:05

Pierre D · Answer 1 · 2020-12-27T15:00:54.823

As several comments have indicated, it's all about locality of reference. Think about what numpy has to do at the low-level, and how far away from each other in memory the consecutive lvalues are in the 3rd case.

Note also how the results of the timings change when the array are not C-contiguous, but F-contiguous instead:

a = np.asfortranarray(a)
b = np.asfortranarray(b)

%timeit a[250, :, :] = b
%timeit a[:, 250, :] = b
%timeit a[:, :, 250] = b

892 µs ± 22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
169 µs ± 66.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(very small side-note: for the same reason, it is sometimes advantageous to sort a DataFrame before doing a groupby and a bunch of repetitive operations on the groups, somewhat counter-intuitively since the sort itself takes O(nlogn)).

Why is performance of in-place modification to a numpy array related to the order of dimension being modified?

1 Answers1