Thanks for the reply @TomFenech, I should have added the CPU usage information indeed:
- Local (4 vCPU): Single call = ~390%, double call ~190-200% each
- Google cluster (8 vCPUs): single call ~400%, double call ~400% each (as expected)
Conclusion of toy-example: You are right. When I call htop
, I actually see 4 processes per started job, not 1. So the job is internally distributing itself. I think this is related, distributing happens for (matrix) multiplication by BLAS/MKL.
Continuation for true job: So, the above toy-example was actually more involved and not a perfect case for my true script. My true (machine learning) script only partially relies on Numpy (not for matrix multiplication), but most heavy computation is performed in PyTorch. When I call my script locally (4 vCPU), it uses ~220% CPU. When I call that script on the Google Cloud cluster (8 vCPU), it - suprisingly - gets even up to ~700% (htop
indeed shows 7-8 processes). So PyTorch seems to be doing an even better job at distributing itself.
(The Numpy BLAS version can be retrieved with np.__config__.show()
. My local Numpy uses OpenBlas, the Google cluster uses MKL (Conda installation). I can't find a similar command to check for the BLAS version of PyTorch, but assume it uses the same.)
In general, the conclusion seems that both Numpy and PyTorch itself already take care of distributing code when it comes to matrix multiplication (and all CPUs are locally visible, i.e. no cluster/server setting). Therefore, if most of your script is matrix multiplication, then there is less reason than (at least I) expected to distribute scripts yourself.
However, not all of my code is matrix multiplication. Therefore, in theory I should still be able to get a speed-up from parallel processes. I wrote a new test, with 50/50 linear and matrix multiplication code:
(speed_test2.py)
import time
import torch
import random
now = time.time()
for i in range(12000):
[random.random() for k in range(10000)]
print('Linear time',round(time.time()-now,1))
now = time.time()
for j in range(350):
torch.matmul(torch.rand(1000,1000),torch.rand(1000,1000))
print('Matrix time',round(time.time()-now,1))
Running this on Google Cloud (8 vCPU):
- Single process gives Linear time 12.6, Matrix time 9.2. (CPU during first part 100%, second part 500%)
- Parallel process
python3 speed_test2.py & python3 speed_test2.py
gives Linear time 12.6, Matrix time 15.4 for both processes.
- Adding a third process gives Linear time ~12.7, Matrix time 25.2
Conclusion: Although there are 8 vCPU here, the Pytorch/matrix (second) part of the code actually gets slower with more than 2 processes. The linear part of the code does of course increase (up to 8 parallel processes). I think this altogether explains why in practice, Numpy/PyTorch code may not show that much improvement when you start multiple concurrent processes. And that it may not always be beneficial to naively start 8 processes when you see 8 vCPUs. Please correct me if I am wrong somewhere here.