MATLAB matrix multiplication performance is 5x faster than NumPy

Question

I've set up two identical tests in MATLAB & Python regarding matrix multiplication with broadcasting. For Python I used NumPy, for MATLAB I used the mtimesx library which uses BLAS.

MATLAB

close all; clear;

N = 1000 + 100; % a few initial runs to be trimmed off at the end

a = 100;
b = 30;
c = 40;
d = 50;
A = rand(b, c, a);
B = rand(c, d, a);
C = zeros(b, d, a);

times = zeros(1, N);
for ii = 1:N
    tic
    C = mtimesx(A,B);
    times(ii) = toc;
end

times = times(101:end) * 1e3;

plot(times);
grid on;
title(median(times));

Python

import timeit
import numpy as np
import matplotlib.pyplot as plt


N = 1000 + 100  # a few initial runs to be trimmed off at the end

a = 100
b = 30
c = 40
d = 50
A = np.arange(a * b * c).reshape([a, b, c])
B = np.arange(a * c * d).reshape([a, c, d])
C = np.empty(a * b * d).reshape([a, b, d])

times = np.empty(N)

for i in range(N):
    start = timeit.default_timer()
    C = A @ B
    times[i] = timeit.default_timer() - start

times = times[101:] * 1e3

plt.plot(times, linewidth=0.5)
plt.grid()
plt.title(np.median(times))
plt.show()

My Python is the default one installed from pip which uses OpenBLAS.
I'm running on Intel NUC i3.

The MATLAB code is running in 1ms while the Python in 5.8ms, and I can't figure out why, as it seems both of them are using BLAS.

EDIT

From Anaconda:

In [7]: np.__config__.show()
mkl_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]
blas_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]
blas_opt_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]
lapack_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]
lapack_opt_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]

From numpy using pip

In [2]: np.__config__.show()
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
    library_dirs = [...]
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    library_dirs = [...]
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
    library_dirs = [...]
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    library_dirs = [...]
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]

EDIT 2 I tried to replace C = A @ B with np.matmul(A, B, out=C) and got 2x worse time, e.g. around 11ms. This is really strange.

See [related question](https://stackoverflow.com/questions/6058139/why-is-matlab-so-fast-in-matrix-multiplication) — etmuse, Oct 17 '18 at 14:54
@etmuse Thanks, already saw that. My argument is that both matlab (or `mtimesx`) and numpy are using BLAS, so I don't see why there should be any difference. — galah92, Oct 17 '18 at 14:54
I get that your question is different, just thought the information in the answers on that question would be useful for both future visitors and potential answerers to this one :) — etmuse, Oct 17 '18 at 15:02
@galah92 _Which_ BLAS, though. If you see the most voted answer in that post, it mentions that Matlab uses [Intel MKL](https://software.intel.com/mkl), which is very fast (on Intel hardware, at least). You can check what you NumPy distribution is using with `np.show_config()`; in my case it is [OpenBLAS](https://www.openblas.net/). The different between these two is [significant](https://software.intel.com/en-us/articles/performance-comparison-of-openblas-and-intel-math-kernel-library-in-r). — jdehesa, Oct 17 '18 at 15:06
Btw, this is why there is [`intel-numpy`](https://pypi.org/project/intel-numpy/), part of the [Intel Distribution for Python](https://software.intel.com/distribution-for-python). — jdehesa, Oct 17 '18 at 15:07
@jdehesa Thanks, I downloaded and benchmarked on anaconda, which uses MKL as well, and got the same results as the default python environment. — galah92, Oct 17 '18 at 15:33
@galah92 Not saying that you are wrong, but just in case, have you checked with `np.show_config()`. Because I have Anaconda but in some environments I have OpenBLAS NumPy (for a variety of reasons). — jdehesa, Oct 17 '18 at 15:37
To reiterate the above: Please show the output of `np.show_config()` — Eric, Oct 17 '18 at 16:12
As a sanity check - can you verify that the result `C` is the same shape / value in both cases? Your initialization is being ignored, as you're just overwriting the variable. — Eric, Oct 18 '18 at 05:37
I think your problem is the memory ordering of your arrays not matching. — Eric, Oct 18 '18 at 05:40

Warren Weckesser · Accepted Answer · 2018-10-20T10:07:15.667

Your MATLAB code uses floating point arrays, but the NumPy code uses integer arrays. This makes a significant difference in the timing. For an "apples to apples" comparison between MATLAB and NumPy, the Python/NumPy code must also use floating point arrays.

That is not the only significant issue, however. There is a deficiency in NumPy discussed in issue 7569 (and again in issue 8957) in the NumPy github site. Matrix multiplication of "stacked" arrays does not use fast BLAS routines to perform the multiplications. This means that the multiplication of arrays with more than two dimensions can be much slower than expected.

Multiplication of 2-d arrays does use the fast routines, so you can work around this issue by multiplying the individual 2-d arrays in a loop. Surprisingly, despite the overhead of a Python loop, it is faster than @, matmul or einsum applied to the full stacked arrays in many cases.

Here's a variation of a function shown in the NumPy issue that does the matrix multiplications in a Python loop:

def xmul(A, B):
    """
    Multiply stacked matrices A (with shape (s, m, n)) by stacked
    matrices B (with shape (s, n, p)) to produce an array with
    shape (s, m, p).

    Mathematically equivalent to A @ B, but faster in many cases.

    The arguments are not validated.  The code assumes that A and B
    are numpy arrays with the same data type and with shapes described
    above.
    """
    out = np.empty((a.shape[0], a.shape[1], b.shape[2]), dtype=a.dtype)
    for j in range(a.shape[0]):
        np.matmul(a[j], b[j], out=out[j])
    return out

My NumPy installation also uses MKL (it is part of the Anaconda distribution). Here is a timing comparison of A @ B and xmul(A, B), using arrays of floating point values:

In [204]: A = np.random.rand(100, 30, 40)

In [205]: B = np.random.rand(100, 40, 50)

In [206]: %timeit A @ B
4.76 ms ± 6.37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [207]: %timeit xmul(A, B)
582 µs ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Even though xmul uses a Python loop, it takes about 1/8 the time of A @ B.

score 1 · Answer 2 · answered Oct 18 '18 at 05:45

I think this is a problem of memory ordering. Matlab's zeros(a, b, c) is like numpy's zeros((a, b, c), order='F'), which is not the default.

Of course, as you've already identified, @ operates on different axes to mtimesx. To make the comparison fair, you should ensure your arrays are in the matlab order, then transpose to deal with the semantics difference

# note: `order` in reshape actually changes the resulting array data,
# not just its memory layout
A = np.arange(a * b * c).reshape([b, c, a], order='F').transpose((2, 0, 1))
B = np.arange(a * c * d).reshape([c, d, a], order='F').transpose((2, 0, 1))

I though about it. MATLAB default's to fortran order and NumPy to "C", and in my practical use case (realtime processing) I do have control over my data layout, so I thought I'll just use the default order in each one. Nevertheless I tried using your code and got around 5.8ms - worse then the "C" order... — galah92, Oct 18 '18 at 06:06

score 1 · Answer 3 · answered Dec 21 '18 at 12:24

1

Could you try again with the recently released NumPy 1.16? We refactored matmul to use BLAS for the inner two dimensions, which should speed up the code.

answered Dec 21 '18 at 12:24

mattip

2,360
1
13
13

MATLAB matrix multiplication performance is 5x faster than NumPy

3 Answers3