1

I have conducted a simple speed test for my numpy:

import numpy as np

A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)

%timeit A.dot(B)

The result is:

30.3 ms ± 829 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This result seems abnormally slow compared with what others typically see (less than 10 ms on average). I'm wondering what could possibly be the cause of such behavior.

My system is MacOS Big Sur on M1 chip. Python version is 3.8.13, numpy version is 1.22.4. The numpy is installed via

pip install "numpy==1.22.4"

The output of np.show_config() is:

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42
    not found = AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_KNL,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

Edit:

I did another test with this code snippet (from 1):

import time
import numpy as np
np.random.seed(42)
a = np.random.uniform(size=(300, 300))
runtimes = 10

timecosts = []
for _ in range(runtimes):
    s_time = time.time()
    for i in range(100):
        a += 1
        np.linalg.svd(a)
    timecosts.append(time.time() - s_time)

print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s')

The result of my test is:

mean of 10 runs: 6.17438s

whereas the reference results on the website 1 are: (the chip is M1 Max)

+-----------------------------------+-----------------------+--------------------+
|   Python installed by (run on)→   | Miniforge (native M1) | Anaconda (Rosseta) |
+----------------------+------------+------------+----------+----------+---------+
| Numpy installed by ↓ | Run from → |  Terminal  |  PyCharm | Terminal | PyCharm |
+----------------------+------------+------------+----------+----------+---------+
|          Apple Tensorflow         |   4.19151  |  4.86248 |     /    |    /    |
+-----------------------------------+------------+----------+----------+---------+
|        conda install numpy        |   4.29386  |  4.98370 |  4.10029 | 4.99271 |
+-----------------------------------+------------+----------+----------+---------+

From the results, the timing of my code is slower compared with any of the numpy versions in the reference.

2 Answers2

0

I've noticed similar slowdowns on M1, but I think the actual cause, at least on my computer, is not a fundamentally faulty Numpy installation, but some problem with the benchmarks themselves. Consider the following example:

In [25]: from scipy import linalg

In [26]: a = np.random.randn(1000,100)

In [27]: %timeit a.T @ a
226 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [28]: x = a.T @ a

In [29]: %timeit linalg.eigh(x)
1.69 ms ± 88.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [30]: %timeit linalg.eigh(a.T @ a)
428 ms ± 99.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Computing x = a.T @ a; eigh(x) takes 2 ms, while eigh(a.T @ a) 400 ms. I think in the latter case it's some problem with %timeit. Maybe for some reason the computation gets routed to "efficiency cores"?

My tentative answer is that your first benchmark with %timeit is not reliable.

0

If you suspect an issue in timeit, try using time instead

import time
start = time.time()

# your numpy test here

took=time.time() - start
print("Test took "+str(took)+" seconds.")

For more information on numpy on Apple silicon, please read the first answer in the link bellow. For optimal performance, it is advised to use Apple's accelerated vecLib. If you install using conda, then check out also @AndrejHribernik's comment: Why Python native on M1 Max is greatly slower than Python on old Intel i5?