Numpy multiplication much slower after reversal

Question

I was multiplying two numpy arrays:

import numpy as np
X = np.random.randn(4500,3500)
v = np.random.randn(3500,200)

Both of them are C_CONTIGUOUS by default:

X.flags
# C_CONTIGUOUS : True
v.flags
# C_CONTIGUOUS : True

And the multiplication is fast:

%timeit X @ v
# 41 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

However, if I reverse X array than something weird happens:

%timeit X[::-1,::-1] @ v
# 3.97 s ± 54.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Questions:

This post says that reversion operation creates a view. The resulting view is neither C_CONTIGUOUS nor F_CONTIGUOUS. What does it mean ?

X[::-1,::-1].flags
# C_CONTIGUOUS : False
# F_CONTIGUOUS : False

Why the reversion operation slows down multiplication so badly ?

Reversing dimensions like that simply makes the strides in that dimension negative, which is by definition, not contiguous. — user3483203, Dec 31 '20 at 15:01
[this stackoverflow answer](https://stackoverflow.com/a/26999092/5666087) provides a great explanation of contiguous vs non-contiguous. — jkr, Dec 31 '20 at 15:54

jakevdp · Accepted Answer · 2020-12-31T16:02:53.193

A c_contiguous arrays is an array represented as a row-major scan over a contiguous buffer. When you create a reversed view the array, this is no longer the case, and so the array is no longer c_contiguous.

As for why the operation is slower over a reversed array, computational details like this will generally vary depending on your system's BLAS/LAPACK installations. In this case, I suspect your BLAS installation has optimized code-paths for the common case of matrix products over contiguous buffers, and does not have optimized code paths for operations over non-contiguous buffers, which are less common.

Indeed, running this on a machine with numpy built against ubuntu's libblas gives the following:

%timeit X @ v
# 1 loop, best of 3: 200 ms per loop
%timeit X[::-1,::-1] @ v
# 1 loop, best of 3: 4.64 s per loop

while running on a machine with numpy built against MKL shows different behavior:

%timeit X @ v                                                                          
# 92.6 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit X[::-1,::-1] @ v                                                               
# 128 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

(different IPython versions account for the different %timeit outputs)

Maybe makes sense to add @user3483203 comment as well here ? I think it gives really good intuition. — Kreol, Jan 01 '21 at 19:25

Numpy multiplication much slower after reversal

1 Answers1