What resource does numpy "lock" across processes?

Question

So a friend noticed something curious about numpy. Here is a minimal example that runs the same script first serially, than two instances parallel each in their own process:

#!/bin/bash
# This is runner.sh

fl=/tmp/$(mktemp test_XXXXX.py)
trap "rm -fv '$fl'" EXIT
cat - > "$fl" <<-'EndOfHereDoc'
#!/usr/bin/env python
import numpy as np
import sys

if __name__ == '__main__':
    if len(sys.argv)>1: print(sys.argv[1] +' start: '+ str(datetime.datetime.now()))
    cube_size=100
    cube=np.zeros((cube_size,cube_size,cube_size))
    cube_ones=np.ones((cube_size,cube_size,cube_size))

    for x in range(10000):
        np.add(cube_ones,cube,out=cube)
    if len(sys.argv)>1: print(sys.argv[1] +' start: '+ str(datetime.datetime.now()))
EndOfHereDoc

echo "Serial"
time python "$fl" 0
echo

echo "Parallel"
time python "$fl" 1&
time python3 "$fl" 2&
wait

rm -fv "$fl"
trap '' EXIT

The output of which is:

$ runner.sh 
Serial
0 start: 2018-09-19 15:46:52.540881
0 end: 2018-09-19 15:47:04.592280

real    0m12,105s
user    0m12,084s
sys 0m0,020s

Parallel
1 start: 2018-09-19 15:47:04.665260
2 start: 2018-09-19 15:47:04.780635
2 end: 2018-09-19 15:47:27.053261

real    0m22,480s
user    0m22,448s
sys 0m0,128s
1 end: 2018-09-19 15:47:27.097312

real    0m22,505s
user    0m22,409s
sys 0m0,040s
removed '/tmp/test_zWN0V.py'

No speedup. It is as if the processes where run one after the other. I assume numpy is using a resource exclusively and the other process waits for that resource to be freed. But what exactly is going on here? The GIL should only be an issue with multi-threading, not multiple processes, right? I find it especially weird, that p2 is not simply waiting for p1 to finish. Instead BOTH processes take ~22s to finish. I'd expect one to get the resource and finish in half the time. While the other waits until the first releases it and takes an additional ~12s.

Note that this also ocours when running the python code with python's own multiprocessing module in a Pool. It does however not occur, if you do something that doesn't involve some specific numpy functions like:

cube_size=25
cube=[0 for i in range(cube_size**3)]

for x in range(10000):
    cube = [ value + 1 for value in cube]

Edit:

I have a real 4-core CPU. I kept hyperthreading in mind, it's not the issue here. During the single process part, one CPU is at 100%, the rest idle. During the two process part, two are at 100%, the rest is idle (as per htop). I understand that numpy runs ATLAS, LAPACK and BLAS libraries in the background, which are not Python (in fact pure C or Fortran). These might utilize parallel techniques. My question here is, why doesn't that show up in CPU utilization?

Maybe you only have a single core CPU? Or insufficient RAM - try running with `/usr/bin/time -l python ...` and looking at the max working set. — Mark Setchell, Sep 19 '18 at 14:05
Possible duplicate of [Multiprocessing.Pool makes Numpy matrix multiplication slower](https://stackoverflow.com/questions/15414027/multiprocessing-pool-makes-numpy-matrix-multiplication-slower) — duhaime, Sep 19 '18 at 14:15
@MarkSetchell Look at the timing output. The parallel processes definitely ran in parallel. The issue here is numpy *is* already multi-threaded (or can be) and is good at multi-threading, despite the GIL (because most of what numpy does doesn't use the interpreter). — Dunes, Sep 19 '18 at 14:45
Just for the record, I have a four-core-CPU. Btw. to prevent other people from fooling themselves, hyperthreading will show up as more CPU cores, than you actually have and can productively use at 100% utilization. My followup question would be, if numpy already runs in parallel, why is the only one core at 100% in the single process example, and the rest idles? — con-f-use, Sep 21 '18 at 10:56
I may have misrepresented the situation. Numpy is parallelised in that it uses SSE/SSE2 intrinsics, but by itself it is not multi-threaded. However, it can and does use multi-threaded libraries where available. For instance, you should find that `np.dot(arr0, arr1)` is faster than `(arr0 * arr1).sum()`. This is because with `np.dot` numpy will try to use a library (BLAS in this case) that computes the dot product using threads. Numpy doesn't know about about the threads as they are internal to the library. — Dunes, Sep 21 '18 at 11:34

score 4 · Answer 1 · answered Sep 19 '18 at 14:37

4

Numpy is not restricted by the GIL as much as core Python is. This is because numpy only stores the array as a Python object. The actual data itself is stored as "primitive" types defined in C. This is also why iterating over a numpy array is much slower than iterating over a Python list. The numpy array has to build a Python object for each value it yields, whereas the Python list already has Python objects.

As numpy is not hampered by the GIL, it is able to use threaded math libraries where available. That is to say, your parallel processes took longer to run because each process was already maxing out your machine and so both processes were competing for the same resources.

Take a look at the output and see what's available in your machine (be warned it's quite verbose).

import numpy.distutils.system_info as sysinfo
sysinfo.show_all()

answered Sep 19 '18 at 14:37

Dunes

37,291
7
81
97

As you seem well informed about `numpy`, maybe you would care to take a look at this https://stackoverflow.com/a/52188236/2836621 Thanks if you have minute, don't worry if not :-) – Mark Setchell Sep 19 '18 at 20:56
1

I add another answer. Not sure how much it really helps though. The question seems to involve much more than just numpy. – Dunes Sep 19 '18 at 22:39
Hmm... but in the example I posted, when I look up my cpu usage with `htop`, I get one core at 100% for the one process part and **two** cores at 100% each for the part where two processes are running. It should show up in `htop` if numpy was already running in parallel. – con-f-use Sep 21 '18 at 10:45
1

"Maxing out your machine" doesn't just mean CPU instructions per second. It can also mean RAM read/writes per second. Your arrays have 1,000,000 items (100^3). You have two of them and they have 64-bit floats by default. That's 16MB just for data of a single process. The L3 cache of an i7 CPU is 8MB and this is shared across all processors. As such, both processes will frequently have to load and write data to RAM, and may have to wait if the other process is doing a read/write. – Dunes Sep 21 '18 at 11:12

What resource does numpy "lock" across processes?

1 Answers1