So a friend noticed something curious about numpy. Here is a minimal example that runs the same script first serially, than two instances parallel each in their own process:
#!/bin/bash
# This is runner.sh
fl=/tmp/$(mktemp test_XXXXX.py)
trap "rm -fv '$fl'" EXIT
cat - > "$fl" <<-'EndOfHereDoc'
#!/usr/bin/env python
import numpy as np
import sys
if __name__ == '__main__':
if len(sys.argv)>1: print(sys.argv[1] +' start: '+ str(datetime.datetime.now()))
cube_size=100
cube=np.zeros((cube_size,cube_size,cube_size))
cube_ones=np.ones((cube_size,cube_size,cube_size))
for x in range(10000):
np.add(cube_ones,cube,out=cube)
if len(sys.argv)>1: print(sys.argv[1] +' start: '+ str(datetime.datetime.now()))
EndOfHereDoc
echo "Serial"
time python "$fl" 0
echo
echo "Parallel"
time python "$fl" 1&
time python3 "$fl" 2&
wait
rm -fv "$fl"
trap '' EXIT
The output of which is:
$ runner.sh
Serial
0 start: 2018-09-19 15:46:52.540881
0 end: 2018-09-19 15:47:04.592280
real 0m12,105s
user 0m12,084s
sys 0m0,020s
Parallel
1 start: 2018-09-19 15:47:04.665260
2 start: 2018-09-19 15:47:04.780635
2 end: 2018-09-19 15:47:27.053261
real 0m22,480s
user 0m22,448s
sys 0m0,128s
1 end: 2018-09-19 15:47:27.097312
real 0m22,505s
user 0m22,409s
sys 0m0,040s
removed '/tmp/test_zWN0V.py'
No speedup. It is as if the processes where run one after the other. I assume numpy is using a resource exclusively and the other process waits for that resource to be freed. But what exactly is going on here? The GIL should only be an issue with multi-threading, not multiple processes, right? I find it especially weird, that p2 is not simply waiting for p1 to finish. Instead BOTH processes take ~22s to finish. I'd expect one to get the resource and finish in half the time. While the other waits until the first releases it and takes an additional ~12s.
Note that this also ocours when running the python code with python's own multiprocessing
module in a Pool
. It does however not occur, if you do something that doesn't involve some specific numpy functions like:
cube_size=25
cube=[0 for i in range(cube_size**3)]
for x in range(10000):
cube = [ value + 1 for value in cube]
Edit:
I have a real 4-core CPU. I kept hyperthreading in mind, it's not the issue here. During the single process part, one CPU is at 100%, the rest idle. During the two process part, two are at 100%, the rest is idle (as per htop
). I understand that numpy runs ATLAS, LAPACK and BLAS libraries in the background, which are not Python (in fact pure C or Fortran). These might utilize parallel techniques. My question here is, why doesn't that show up in CPU utilization?