Why is numpy native on M1 Max greatly slower than on old Intel i5?

Question

I just got my new MacBook Pro with M1 Max chip and am setting up Python. I've tried several combinational settings to test speed - now I'm quite confused. First put my questions here:

Why python run natively on M1 Max is greatly (~100%) slower than on my old MacBook Pro 2016 with Intel i5?
On M1 Max, why there isn't significant speed difference between native run (by miniforge) and run via Rosetta (by anaconda) - which is supposed to be slower ~20%?
On M1 Max and native run, why there isn't significant speed difference between conda installed Numpy and TensorFlow installed Numpy - which is supposed to be faster?
On M1 Max, why run in PyCharm IDE is constantly slower ~20% than run from terminal, which doesn't happen on my old Intel Mac.

Evidence supporting my questions is as follows:

Here are the settings I've tried:

1. Python installed by

Miniforge-arm64, so that python is natively run on M1 Max Chip. (Check from Activity Monitor, Kind of python process is Apple).
Anaconda. Then python is run via Rosseta. (Check from Activity Monitor, Kind of python process is Intel).

2. Numpy installed by

conda install numpy: numpy from original conda-forge channel, or pre-installed with anaconda.
Apple-TensorFlow: with python installed by miniforge, I directly install tensorflow, and numpy will also be installed. It's said that, numpy installed in this way is optimized for Apple M1 and will be faster. Here is the installation commands:

conda install -c apple tensorflow-deps
python -m pip install tensorflow-macos
python -m pip install tensorflow-metal

3. Run from

Terminal.
PyCharm (Apple Silicon version).

Here is the test code:

import time
import numpy as np
np.random.seed(42)
a = np.random.uniform(size=(300, 300))
runtimes = 10

timecosts = []
for _ in range(runtimes):
    s_time = time.time()
    for i in range(100):
        a += 1
        np.linalg.svd(a)
    timecosts.append(time.time() - s_time)

print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s')

and here are the results:

+-----------------------------------+-----------------------+--------------------+
|   Python installed by (run on)→   | Miniforge (native M1) | Anaconda (Rosseta) |
+----------------------+------------+------------+----------+----------+---------+
| Numpy installed by ↓ | Run from → |  Terminal  |  PyCharm | Terminal | PyCharm |
+----------------------+------------+------------+----------+----------+---------+
|          Apple Tensorflow         |   4.19151  |  4.86248 |     /    |    /    |
+-----------------------------------+------------+----------+----------+---------+
|        conda install numpy        |   4.29386  |  4.98370 |  4.10029 | 4.99271 |
+-----------------------------------+------------+----------+----------+---------+

This is quite slow. For comparison,

run the same code on my old MacBook Pro 2016 with i5 chip - it costs 2.39917s.
another post (but not in English) reports that run with M1 chip (not Pro or Max), miniforge+conda_installed_numpy is 2.53214s, and miniforge+apple_tensorflow_numpy is 1.00613s.
you may also try on it your own.

Here is the CPU information details:

My old i5:

$ sysctl -a | grep -e brand_string -e cpu.core_count
machdep.cpu.brand_string: Intel(R) Core(TM) i5-6360U CPU @ 2.00GHz
machdep.cpu.core_count: 2

My new M1 Max:

% sysctl -a | grep -e brand_string -e cpu.core_count
machdep.cpu.brand_string: Apple M1 Max
machdep.cpu.core_count: 10

I follow instructions strictly from tutorials - but why would all these happen? Is it because of my installation flaws, or because of M1 Max chip? Since my work relies heavily on local runs, local speed is very important to me. Any suggestions to possible solution, or any data points on your own device would be greatly appreciated :)

I don't have an M1 yet. The easiest way to test would be to compare the setup of people who set it up correctly. Here are links to [a set of benchmarks](https://towardsdatascience.com/m1-macbook-pro-vs-intel-i9-macbook-pro-ultimate-data-science-comparison-dde8fc32b5df) and [the installation procedure they used](https://towardsdatascience.com/how-to-easily-set-up-m1-macbooks-for-data-science-and-machine-learning-cd4f8a6b706d), including how the performance should look like in the Activity Monitor. If you can replicate their results, then M1 can't handle your code, otherwise it was installation. — Amadan, Dec 06 '21 at 03:40
Got m2 max 96gb (nearly the top of the line). I did the TF install for Metal , detected GPU, and ran apple test script for keras training. And that's fast. However, when I did your np benchmark (at the same conda env), I was shocked getting "14.91887s"!! This is slower than anything I have seen. Note that I didn't explicitly install np, but got it by installing TF. And this is alarming it can get as bas as 14s out of box. I hope I can follow answers below and fix this. — kawingkelvin, Mar 10 '23 at 23:17

graphitump · Accepted Answer · 2022-03-29T03:35:34.397

Update Mar 28 2022: Please see @AndrejHribernik's comment below.

How to install numpy on M1 Max, with the most accelerated performance (Apple's vecLib)? Here's the answer as of Dec 6 2021.

Steps

I. Install miniforge

So that your Python is run natively on arm64, not translated via Rosseta.

Download Miniforge3-MacOSX-arm64.sh, then
Run the script, then open another shell

$ bash Miniforge3-MacOSX-arm64.sh

Create an environment (here I use name np_veclib)

$ conda create -n np_veclib python=3.9
$ conda activate np_veclib

II. Install Numpy with BLAS interface specified as vecLib

To compile numpy, first need to install cython and pybind11:

$ conda install cython pybind11

Compile numpy by (Thanks @Marijn's answer) - don't use conda install!

$ pip install --no-binary :all: --no-use-pep517 numpy

An alternative of 2. is to build from source

$ git clone https://github.com/numpy/numpy
$ cd numpy
$ cp site.cfg.example site.cfg
$ nano site.cfg

Edit the copied site.cfg: add the following lines:

[accelerate]
libraries = Accelerate, vecLib

Then build and install:

$ NPY_LAPACK_ORDER=accelerate python setup.py build
$ python setup.py install

After either 2 or 3, now test whether numpy is using vecLib:

>>> import numpy
>>> numpy.show_config()

Then, info like /System/Library/Frameworks/vecLib.framework/Headers should be printed.

III. For further installing other packages using conda

Make conda recognize packages installed by pip

conda config --set pip_interop_enabled true

This must be done, otherwise if e.g. conda install pandas, then numpy will be in The following packages will be installed list and installed again. But the new installed one is from conda-forge channel and is slow.

Comparisons to other installations:

1. Competitors:

Except for the above optimal one, I also tried several other installations

A. np_default: conda create -n np_default python=3.9 numpy
B. np_openblas: conda create -n np_openblas python=3.9 numpy blas=*=*openblas*
C. np_netlib: conda create -n np_netlib python=3.9 numpy blas=*=*netlib*

The above ABC options are directly installed from conda-forge channel. numpy.show_config() will show identical results. To see the difference, examine by conda list - e.g. openblas packages are installed in B. Note that mkl or blis is not supported on arm64.

D. np_openblas_source: First install openblas by brew install openblas. Then add [openblas] path /opt/homebrew/opt/openblas to site.cfg and build Numpy from source.
M1 and i9–9880H in this post.
My old i5-6360U 2cores on MacBook Pro 2016 13in.

2. Benchmarks:

Here I use two benchmarks:

mysvd.py: My SVD decomposition

import time
import numpy as np
np.random.seed(42)
a = np.random.uniform(size=(300, 300))
runtimes = 10

timecosts = []
for _ in range(runtimes):
    s_time = time.time()
    for i in range(100):
        a += 1
        np.linalg.svd(a)
    timecosts.append(time.time() - s_time)

print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s')

dario.py: A benchmark script by Dario Radečić at the post above.

3. Results:

+-------+-----------+------------+-------------+-----------+--------------------+----+----------+----------+
|  sec  | np_veclib | np_default | np_openblas | np_netlib | np_openblas_source | M1 | i9–9880H | i5-6360U |
+-------+-----------+------------+-------------+-----------+--------------------+----+----------+----------+
| mysvd |  1.02300  |   4.29386  |   4.13854   |  4.75812  |      12.57879      |  / |     /    |  2.39917 |
+-------+-----------+------------+-------------+-----------+--------------------+----+----------+----------+
| dario |     21    |     41     |      39     |    323    |         40         | 33 |    23    |    78    |
+-------+-----------+------------+-------------+-----------+--------------------+----+----------+----------+

what's the purpose of adding `--no-use-pep517` to the `pip install` command? — ogb119, Jan 09 '22 at 18:24
`--no-binary :all:` ignores all existed wheels and build wheels from scratch. Then without `--no-use-pep517` will cause `could not build wheels for ...` error. — graphitump, Jan 11 '22 at 00:25
First thank you @graphitump great instructions and reproducible test case. Would like to say that the accelerate BLAS is now available to be specified with conda. No need to compile things manually. It is still not the default BLAS on M1 architecture, so needs to be explicitly specified `conda create -n np_accelerate python=3.9 numpy “blas=*=*accelerate*”` — Andrej Hribernik, Mar 28 '22 at 02:17
install can be one liner by specifying conda-forge in the create command: `conda create -n np_accelerate -c conda-forge python=3.9 numpy "blas=*=*accelerate*"` — alexbhandari, May 29 '23 at 13:32
Note that pytorch does not play well with an accelerate-based BLAS without recompilation. It uses openblas in version 2.0.0 which is slow but can use MPS with GPU for tensors though. If you try to set the version as above, you will get a dylib error that libopenblas.0.dylib is missing when you import torch. https://github.com/pytorch/pytorch/issues/71712#issuecomment-1020411542 — Traveler, Jun 11 '23 at 19:02
And a lot more detail about using accelerate libraries and pytorch. Note that the LAPACK version is no long out of date and is at 3.9.1 https://github.com/conda-forge/pytorch-cpu-feedstock/pull/88 https://github.com/conda-forge/numpy-feedstock/issues/253 — Traveler, Jun 11 '23 at 19:38

merv · Answer 2 · 2023-07-06T04:43:23.703

Possible Cause: Different BLAS Libraries

Since the benchmark is running linear algebra routines, what is likely being tested here are the BLAS implementations. A default Anaconda distribution for osx-64 platform is going to come with Intel's MKL implementation; the osx-arm64 platform only had the generic Netlib BLAS and the OpenBLAS implementation options when this question was first asked.

I get the following benchmark results (updated May 2023):

macOS Intel Core i9 (x86_64)

BLAS Implmentation	Mean Timing (s)
`mkl`	0.95932
`blis`	1.72059
`openblas`	2.17023
`accelerate`	2.56365
`netlib`	5.72782

macOS M1 4P/4E (arm64)

BLAS Implmentation	Mean Timing (s)
`accelerate` (macOS 13.3)	0.98718
`accelerate` (macOS 13.2)	1.03141
`netlib`	4.36523
`openblas`	10.33956

So, I suspect the old MBP had MKL installed, while the M1 system in OP is installing either Netlib or OpenBLAS. Apple Silicon users should identify which library runs fastest on their system. The consensus appears to be that for M1 and M2 systems, Apple's Accelerate library is the most performant.

It should be noted that Accelerate implementation was significantly updated in macOS 13.3 and this seems to provide a slight performance boost (~5%).

Hence, M1/M2 users should consider including the requirement:

'blas=*=accelerate'

when creating Conda environments.

Specifying BLAS Implementation

Here are specifically the different environments I tested:

## Note that `conda-forge` channel is prioritized on my system
## also, `mamba` is a faster version of `conda`

# MKL
mamba create -n np_mkl python=3.9 numpy 'blas=*=mkl'

# BLIS
mamba create -n np_blis python=3.9 numpy 'blas=*=blis'

# OpenBLAS
mamba create -n np_openblas python=3.9 numpy 'blas=*=openblas'

# Accelerate
mamba create -n np_accelerate python=3.9 numpy 'blas=*=accelerate'

# Netlib
mamba create -n np_netlib python=3.9 numpy 'blas=*=netlib'

and ran the benchmark script (so-np-bench.py) with

conda run -n np_mkl python so-np-bench.py

# etc.

Possible caveats with using Accelerate

Note that there has been incompatibility between SciPy and Accelerate due to the LAPACK implementation in Accelerate being old. However, it has been updated in macOS 13.3, which may resolve such issues. See this thread for details.

Emulation Mode (Rosetta)

Sometimes one may have to create an environment in emulation mode, e.g., need a package that doesn't have an osx-arm64 build yet. Here's the benchmarking (as of July 2023) using forced osx-64 on an M1:

macOS M1 4P/4E (arm64 with osx-64 subdir)

BLAS Implmentation	Mean Timing (s)
`accelerate`	1.71188
`blis`	2.41957
`netlib`	3.92932
`openblas`	5.98487
`mkl`	n/a (cannot emulate)

These runs use the same creation commands as osx-64 above, but with CONDA_SUBDIR=osx-64.

Important Note: The openblas runs use all cores, but to no apparent benefit. So I'd absolute recommend avoiding that for now. There might be some bug in how it is using OpenMP?

Takeaway: if you are on an Apple Silicon machine, use 'blas=*=accelerate' whether or not you are emulating. However, the native environment seems to be the fastest option.

Thanks @merv. I guess it's the right way. I created 3 environments: `np_default`, `np_openblas` and `np_netlib`. But each produces very similar result. And further see which BLAS interface is used by `numpy.show_config()` - each is exactly the same - only `libraries = ['cblas', 'blas', 'cblas', 'blas']` in `blas_info`, no `openblas` or `netlib` - which means they three actually installed the same numpy. Could you please explain why? — graphitump, Dec 06 '21 at 21:56
I asked a friend who is using M1, not M1 Pro or Max. He used exactly the same procedure to install python as me (by miniforge, then `conda install numpy`). But he got `openblas` in his numpy, while I don't. — graphitump, Dec 06 '21 at 22:08
Do I need to install openblas and netlib by myself before using conda to install respective numpy? — graphitump, Dec 06 '21 at 22:34
@graphitump the BLAS libraries will all show up identically to `numpy.show_config()` because they (`libblas`, `libcblas`, `liblapack`, etc) go by the same name and have the same API, but link to the different libraries (`openblas`, `mkl`, etc.). You have to examine the `conda list` package *builds*, which will have strings like `openblas`, `netlib`, etc. — merv, Dec 06 '21 at 22:39
@graphitump the libraries should be installed through Conda, as indicated in the answer. — merv, Dec 06 '21 at 22:45
Thank you @merv for pointing out this. Yes `conda list` can show the difference. But it turns out that conda-forge channel cannot install the fastest one - even `openblas` is slow at ~4.2s. A solution is to install by `pip` or build from source, and use the `vecLib` package. Details are posted in my answer to this question. Thanks so much :) — graphitump, Dec 07 '21 at 04:57

score 8 · Answer 3 · answered Jun 06 '22 at 12:30

8

With Miniforge3-MacOSX-arm64, and conda install -c conda-forge numpy "libblas=*=*accelerate", it works perfectly on my Macbook M1 Max.

M1 Max with libblas accelerate : 1.024
M1 Max without libblas accelerate : 2.672

answered Jun 06 '22 at 12:30

Ubique

81
1
2

1

It results in an amazing performance difference on my M1 Pro. Thank you! – seongjoo Sep 22 '22 at 12:48
thanks. I used miniconda instead, and same command works. that benchmark script is 0.8 for me on M2 Max. – kawingkelvin Mar 11 '23 at 00:06

score 1 · Answer 4 · answered Nov 13 '22 at 09:57

Thank you for the tips. I followed the following in my freshly new MAC M1 MAX:

I installed the Minoforge3 (bash Miniforge3-MacOSX-arm64.sh)
Initialized a conda base environment (conda init) with Python 3.10
Installed numpy as: conda install numpy "libblas=*=*accelerate"

And then the suggested benchmarks from the link:

The script mention above mysvd.py runs in mean of 10 runs: 1.08088s
The script dario.py from https://gist.githubusercontent.com/daradecic/a2ac0a75d7e5f22c9aa07174dcbbe061/raw/a56ee217e6d3f949b1d1f719a7a134cef130cd9f/macs.py gives:

Dotted two 4096x4096 matrices in 0.28 s.
Dotted two vectors of length 524288 in 0.11 ms.
SVD of a 2048x1024 matrix in 0.44 s.
Cholesky decomposition of a 2048x2048 matrix in 0.07 s.
Eigendecomposition of a 2048x2048 matrix in 3.83 s.

TOTAL TIME = 19 seconds

alexbhandari · Answer 5 · 2023-05-29T13:28:06.077

1

Fastest way I found to do this is:

conda create -n np_accelerate -c conda-forge python=3.9 numpy 'blas=*=accelerate'

If you want to add conda forge for future use can use config append:

conda config --append channels conda-forge

conda create -n np_accelerate python=3.9 numpy 'blas=*=accelerate'

Other answers mention a few variants of this: blas=*=*accelerate* and libblas=*=*accelerate. These all work as well with the above command and result in same performance in my testing.

edited May 29 '23 at 13:28

answered May 29 '23 at 13:14

alexbhandari

1,310
12
21

Do you find that it will work with python 3.10 ? – Traveler Jun 10 '23 at 20:08
1

Yes it does work with python 3.10. But not able to combine with pytorch which requires openblas unless recompiled, as mentioned separately. – Traveler Jun 11 '23 at 19:04
Thanks for checking! Should work with all newer python versions but I did not check that myself PyTorch also supports M1/2 GPU if compiled properly https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/ – alexbhandari Jun 11 '23 at 21:44
Here is more detail on the incompatibility of pytorch and accelerate via conda: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/88#issuecomment-1657284460 – Traveler Jul 31 '23 at 02:36