Multiple python calls from bash but no speed-up

Question

I want to run a Python3 process multiple times with different hyperparameters. To fully utilize the available CPU's, I want to spawn the process multiple times. However, I hardly observe any speed-up in practice. Below I will reproduce a small test that illustrates the effect.

First a Python test script:

(speed_test.py)

import numpy as np
import time

now = time.time()
for i in range(50):
   np.matmul(np.random.rand(1000,1000),np.random.rand(1000,1000))
print(round(time.time()-now,1))

A single call: python3 speed_test.py prints 10.0 seconds.

However, when I try to run 2 processes in parallel:

python3 speed_test.py & python3 speed_test.py & wait prints 18.6 18.9.
parallel python3 speed_test.py ::: {1..2} prints 18.3 18.7.

It seems as if parallelization hardly buys me anything here (two executions in almost twice the time). I know I can't expect a linear speed-up, but this seems to be very little difference. My system has 1 socket with 2 cores per socket and 2 threads per core (4 CPUs in total). I see the same effect on a 8 CPU Google Cloud instance. Roughly, the computational time improves no more than ~10-20% per process, when running in parallel.

Finally, pinning CPUs to processes does not help much either:

taskset -c 0-1 python3 speed_test.py & taskset -c 2-3 python3 speed_test.py & wait prints 17.1 17.8

I thought each Python process could only utilize 1 CPU due to the Global Interpreter Lock. Is there anyway to speed-up my code?

Just guessing, but try and run `top` when you run a single instance of your python program. I suspect that the version of numpy that you're using has been compiled against low level math libraries that use more than one core - if you see that your program is using >100% CPU, that's likely to be the case. — Tom Fenech, Aug 24 '18 at 15:15
@Tom Fenech. I did some extra tests after your comment, and wrote a new answer below. — Thomas, Aug 24 '18 at 20:46

score 1 · Answer 1 · answered Aug 24 '18 at 20:45

Thanks for the reply @TomFenech, I should have added the CPU usage information indeed:

Local (4 vCPU): Single call = ~390%, double call ~190-200% each
Google cluster (8 vCPUs): single call ~400%, double call ~400% each (as expected)

Conclusion of toy-example: You are right. When I call htop, I actually see 4 processes per started job, not 1. So the job is internally distributing itself. I think this is related, distributing happens for (matrix) multiplication by BLAS/MKL.

Continuation for true job: So, the above toy-example was actually more involved and not a perfect case for my true script. My true (machine learning) script only partially relies on Numpy (not for matrix multiplication), but most heavy computation is performed in PyTorch. When I call my script locally (4 vCPU), it uses ~220% CPU. When I call that script on the Google Cloud cluster (8 vCPU), it - suprisingly - gets even up to ~700% (htop indeed shows 7-8 processes). So PyTorch seems to be doing an even better job at distributing itself. (The Numpy BLAS version can be retrieved with np.__config__.show(). My local Numpy uses OpenBlas, the Google cluster uses MKL (Conda installation). I can't find a similar command to check for the BLAS version of PyTorch, but assume it uses the same.)

In general, the conclusion seems that both Numpy and PyTorch itself already take care of distributing code when it comes to matrix multiplication (and all CPUs are locally visible, i.e. no cluster/server setting). Therefore, if most of your script is matrix multiplication, then there is less reason than (at least I) expected to distribute scripts yourself.

However, not all of my code is matrix multiplication. Therefore, in theory I should still be able to get a speed-up from parallel processes. I wrote a new test, with 50/50 linear and matrix multiplication code:

(speed_test2.py)

import time
import torch
import random

now = time.time()
for i in range(12000):
    [random.random() for k in range(10000)]

print('Linear time',round(time.time()-now,1))
now = time.time()

for j in range(350):
    torch.matmul(torch.rand(1000,1000),torch.rand(1000,1000))

print('Matrix time',round(time.time()-now,1))

Running this on Google Cloud (8 vCPU):

Single process gives Linear time 12.6, Matrix time 9.2. (CPU during first part 100%, second part 500%)
Parallel process python3 speed_test2.py & python3 speed_test2.py gives Linear time 12.6, Matrix time 15.4 for both processes.
Adding a third process gives Linear time ~12.7, Matrix time 25.2

Conclusion: Although there are 8 vCPU here, the Pytorch/matrix (second) part of the code actually gets slower with more than 2 processes. The linear part of the code does of course increase (up to 8 parallel processes). I think this altogether explains why in practice, Numpy/PyTorch code may not show that much improvement when you start multiple concurrent processes. And that it may not always be beneficial to naively start 8 processes when you see 8 vCPUs. Please correct me if I am wrong somewhere here.

Multiple python calls from bash but no speed-up

1 Answers1