Numpy on Mac M1 abnormally slow

Question

I have conducted a simple speed test for my numpy:

import numpy as np

A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)

%timeit A.dot(B)

The result is:

30.3 ms ± 829 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This result seems abnormally slow compared with what others typically see (less than 10 ms on average). I'm wondering what could possibly be the cause of such behavior.

My system is MacOS Big Sur on M1 chip. Python version is 3.8.13, numpy version is 1.22.4. The numpy is installed via

pip install "numpy==1.22.4"

The output of np.show_config() is:

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42
    not found = AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_KNL,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

Edit:

I did another test with this code snippet (from 1):

import time
import numpy as np
np.random.seed(42)
a = np.random.uniform(size=(300, 300))
runtimes = 10

timecosts = []
for _ in range(runtimes):
    s_time = time.time()
    for i in range(100):
        a += 1
        np.linalg.svd(a)
    timecosts.append(time.time() - s_time)

print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s')

The result of my test is:

mean of 10 runs: 6.17438s

whereas the reference results on the website 1 are: (the chip is M1 Max)

+-----------------------------------+-----------------------+--------------------+
|   Python installed by (run on)→   | Miniforge (native M1) | Anaconda (Rosseta) |
+----------------------+------------+------------+----------+----------+---------+
| Numpy installed by ↓ | Run from → |  Terminal  |  PyCharm | Terminal | PyCharm |
+----------------------+------------+------------+----------+----------+---------+
|          Apple Tensorflow         |   4.19151  |  4.86248 |     /    |    /    |
+-----------------------------------+------------+----------+----------+---------+
|        conda install numpy        |   4.29386  |  4.98370 |  4.10029 | 4.99271 |
+-----------------------------------+------------+----------+----------+---------+

From the results, the timing of my code is slower compared with any of the numpy versions in the reference.

What is the question? What is your reference point and computer specs? — qwr, Jul 07 '22 at 05:55
@qwr My computer specs is as mentioned in the post. I will add a reference test result. — Jingyang Wang, Jul 07 '22 at 06:00
This may be a complete red herring, because I really do not know what I am talking about, but: I get more or less the same output for "Supported SIMD extensions" on an Intel Mac... Isn’t that a bit worrying? I’d expect it to be different on an entirely different CPU family (and it is, when I check on my Raspberry Pi). — Ture Pålsson, Jul 07 '22 at 09:29
Are you sure you are using an ARM Python and not an Intel Python? — David Hoelzer, Jan 31 '23 at 13:53

score 0 · Answer 1 · answered Jan 30 '23 at 13:55

I've noticed similar slowdowns on M1, but I think the actual cause, at least on my computer, is not a fundamentally faulty Numpy installation, but some problem with the benchmarks themselves. Consider the following example:

In [25]: from scipy import linalg

In [26]: a = np.random.randn(1000,100)

In [27]: %timeit a.T @ a
226 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [28]: x = a.T @ a

In [29]: %timeit linalg.eigh(x)
1.69 ms ± 88.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [30]: %timeit linalg.eigh(a.T @ a)
428 ms ± 99.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Computing x = a.T @ a; eigh(x) takes 2 ms, while eigh(a.T @ a) 400 ms. I think in the latter case it's some problem with %timeit. Maybe for some reason the computation gets routed to "efficiency cores"?

My tentative answer is that your first benchmark with %timeit is not reliable.

score 0 · Answer 2 · answered Jan 31 '23 at 13:50

If you suspect an issue in timeit, try using time instead

import time
start = time.time()

# your numpy test here

took=time.time() - start
print("Test took "+str(took)+" seconds.")

For more information on numpy on Apple silicon, please read the first answer in the link bellow. For optimal performance, it is advised to use Apple's accelerated vecLib. If you install using conda, then check out also @AndrejHribernik's comment: Why Python native on M1 Max is greatly slower than Python on old Intel i5?

Numpy on Mac M1 abnormally slow

2 Answers2