numpy matrix algebra or python for-loop/list-comprehension automatically utilizes multiprocessing?

Question

I always thought python list comprehension doesn't implicitly utilize multiprocessing, and reading questions on stack (e.g. this one) also gave me the same impression. However, here is my little experiment:

import numpy as np
import time

# some arbitrary data
n = 1000
p = 5
X = np.block([[np.eye(p)], [np.zeros((n-p, p))]])
y = np.sum(X, axis=1) + np.random.normal(0, 1, (n, ))
n_loop = 100000

# run linear regression using direct matrix algebra
def in_sample_error_algebra(X, y):
    beta_hat = np.linalg.inv(X.transpose()@X)@(X.transpose()@y)
    y_hat = X@beta_hat
    error = metrics.mean_squared_error(y, y_hat)
    return error


start = time.time()
errors = [in_sample_error_algebra(X, y) for _ in range(n_loop)]
print('run time =', round(time.time() - start, 2), 'seconds')

run time = 19.68 seconds

While this code was running, all 6 (physical) cores of my CPU shot up to 100%

What's even more magical is, when I changed from list comprehension to for-loop, the same thing happened. I thought with the .append, it had to be done sequentially. See below:

start = time.time()
errors = []
for _ in range(n_loop):
    errors.append(in_sample_error_algebra(X, y))
print('run time =', round(time.time() - start, 2), 'seconds')

run time = 21.29 seconds

Any theories?

Python 3.7.2, numpy 1.15.4

Does multithreading even help CPU bound programs in python? I thought there was a GIL. — Indominus, Jan 29 '19 at 05:52

MB-F · Accepted Answer · 2019-01-29T16:05:11.357

2

Indeed, pure Python computations do not benefit from multithreading. The global interpreter lock (GIL) prevents multiple threads from accessing the interpreter at the same time.

However, multiprocessing is possible in Python because each process runs its own instance of the Python interpreter. This has a performance cost: initialization and data sharing is not free between processes. Often it's not even worth the effort.

The story is different for numpy. Numpy largely consists of native functions written in C. When C code does not need the interpreter for a while it can release the GIL and allow a different Python thread to run at the same time. The C code can also spawn "non-Python" threads to parallelize computations. This is what happens in numpy.

Actually, numpy itself does not spawn threads (I think) but many of the matrix/vector and linear algebra routines call into the low level libraries BLAS and LAPACK. There exist various implementations of these libraries and some are optimized for multithreading. Your version of numpy apparently uses one of those.

In conclusion, neither the outer list comprehension nor the for loop run in parallel, but np.linalg.inv and also the matrix product X @ beta_hat may internally run multiple threads. See Parallel Programming with numpy and scipy for more information.

edited Jan 29 '19 at 16:05

answered Jan 29 '19 at 07:24

MB-F

22,770
4
61
116

Thanks for the great explanations! It seems whether numpy utilized more than one core of my machine is not determined at "architect time", it depends on the version of numpy, and what functions could be used (say inv or @ vs simple + - ). That's a problem if I want to design my code in a generic way - if I provide an interface that allow certain virtual function to be parallelized, I don't know whether someone else would make an implementation calling already-parallelized-function or not. If yes, then the performance would be disastrous (I tested). Any usual good practice in this case? – Indominus Jan 29 '19 at 23:27
Even worse, it does not only depend on the numpy version but the same version can be linked against different computation backends. If you provide functions make parallelization an optional feature and warn the user to be sensible ;) You can try to disable multithreading in numpy (get started here: [Python: How do you stop numpy from multithreading?](https://stackoverflow.com/q/17053671/3005167)) but I don't know how portable or robust that is. – MB-F Jan 30 '19 at 07:20
If, instead of my local computer, my optional palatalization feature is only opened for users to utilize a remote cluster consisting of multiple machines, would you say this issue is mitigated? I'm thinking AWS, once I throw my n jobs at it, it wouldn't fall into the same trap as when I do the same to my local single machine, would it? – Indominus Jan 30 '19 at 08:46
My experience with clustering is very limited. If there is only ever one active job per machine it should be fine. I've seen in the past that numpy used at most 8 or 16 cores or so, but that was several years ago and I'm not sure if it is actually useful information in your case... – MB-F Jan 30 '19 at 10:11
Thank you so much and this should probably be a separate question. Maybe after I try it out on AWS first. – Indominus Jan 30 '19 at 10:45

numpy matrix algebra or python for-loop/list-comprehension automatically utilizes multiprocessing?

1 Answers1