Why this small snippet hangs using multiprocessing with maxtasksperchild, numpy.random.randint and numpy.random.seed?

Question

I have a python script that concurrently processes numpy arrays and images in a random way. To have proper randomness inside the spawned processes I pass a random seed from the main process to the workers for them to be seeded.

When I use maxtasksperchild for the Pool, my script hangs after running Pool.map a number of times.

The following is a minimal snippet that reproduces the problem :

# This code stops after multiprocessing.Pool workers are replaced one single time.
# They are replaced due to maxtasksperchild parameter to Pool
from multiprocessing import Pool
import numpy as np

def worker(n):
    # Removing np.random.seed solves the issue
    np.random.seed(1) #any seed value
    return 1234 # trivial return value

# Removing maxtasksperchild solves the issue
ppool = Pool(20 , maxtasksperchild=5)
i=0
while True:
    i += 1
    # Removing np.random.randint(10) or taking it out of the loop solves the issue
    rand = np.random.randint(10)
    l  = [3] # trivial input to ppool.map
    result = ppool.map(worker, l)
    print i,result[0]

This is the output

1 1234
2 1234
3 1234
.
.
.
99 1234
100 1234 # at this point workers should've reached maxtasksperchild tasks
101 1234
102 1234
103 1234
104 1234
105 1234
106 1234
107 1234
108 1234
109 1234
110 1234

then hangs indefinitely.

I could potentially replace numpy.random with python's random and get away with the problem. However in my actual application, the worker will execute user code (given as argument to the worker) which i have no control over, and would like to allow using numpy.random functions in that user code. So I intentionally want to seed the global random generator (for each process independently).

This was tested with Python 2.7.10, numpy 1.11.0, 1.12.0 & 1.13.0, Ubuntu and OSX

[Can't reproduce](http://ideone.com/K2o7l1) on Ideone with a 2-process pool. (Ideone wouldn't let me use 20.) Do your results depend on the pool size? — user2357112, Jun 12 '17 at 23:34
Maybe..I just retried this and it hangs for Pool with 7+ workers, but hangs at different times in each run. So looks like a race condition that appears more prominently as workers increase — MohamedEzz, Jun 12 '17 at 23:41

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

It turns out this is coming from a Python buggy interaction of threading.Lock and multiprocessing.

np.random.seed and most np.random.* functions use a threading.Lock to ensure thread-safety. A np.random.* function generates a random number then update the seed (shared across threads), that's why a lock is needed. See np.random.seed and cont0_array (used by np.random.random() and others).

Now how does this cause a problem in the above snippet ?

In a nutshell, the snippet hangs because the threading.Lock state is inherited when forking. So when a child is forked at the same time the lock is acquired in the parent (by np.random.randint(10)), the child deadlocks (at np.random.seed).

@njsmith explains it in this github issue https://github.com/numpy/numpy/issues/9248#issuecomment-308054786

multiprocessing.Pool spawns a background thread to manage workers: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L170-L173

It loops in the background calling _maintain_pool: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L366

If a worker exits, for example due to a maxtasksperchild limit, then _maintain_pool calls _repopulate_pool: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L240

And then _repopulate_pool forks some new workers, still in this background thread: https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L224

So what's happening is that eventually you get unlucky, and at the same moment that your main thread is calling some np.random function and holding the lock, multiprocessing decides to fork a child, which starts out with the np.random lock already held but the thread that was holding it is gone. Then the child tries to call into np.random, which requires taking the lock, and so the child deadlocks.

The simple workaround here is to not use fork with multiprocessing. If you use the spawn or forkserver start methods then this should go away.

For a proper fix.... ughhh. I guess we.. need to register a pthread_atfork pre-fork handler that takes the np.random lock before fork and then releases it afterwards? And really I guess we need to do this for every lock in numpy, which requires something like keeping a weakset of every RandomState object, and _FFTCache also appears to have a lock...

(On the plus side, this would also give us an opportunity to reinitialize the global random state in the child, which we really should be doing in cases where the user hasn't explicitly seeded it.)

I figured it'd be one of those ugly thread/fork-without-exec issues. Fork-without-exec is pretty fragile in general. — user2357112, Jun 13 '17 at 17:20

matteo · Answer 2 · 2017-06-12T09:38:02.380

1

Using numpy.random.seed is not thread safe. numpy.random.seed changes the value of the seed globally, while - as far as I understand - you are trying to change the seed locally.

See the docs

If indeed what you are trying to achieve is having the generator seeded at the start of each worker, the following is a solution:

def worker(n):
    # Removing np.random.seed solves the problem                                                               
    randgen = np.random.RandomState(45678) # RandomState, not seed!
    # ...Do something with randgen...                                           
    return 1234 # trivial return value

edited Jun 12 '17 at 09:38

answered Jun 11 '17 at 21:54

matteo

289
2
11

1

Thanks. However `np.random.RandomState(1)` doesn't seed the random generator. Also could you please point me to where in the docs it talks about thread-safety of `numpy.random.seed`, I couldn't find it – MohamedEzz Jun 11 '17 at 22:21
If by thread safety you meant that each process will generate the same random numbers...yes, this why I pass a random seed from the main process to each worker – MohamedEzz Jun 11 '17 at 22:48
It depends what you want to achieve. If it's having a new generator in each worker, seeded with 45678, this works. As far as I understand, `randgen` is now the equivalent of `numpy.random`, seeded with 45678. – matteo Jun 12 '17 at 09:32
Again, as far as I understand, `numpy.seed` changes the global seed every time it's executed. I'm not sure, but I think this means that during the execution of one thread you are changing the seed that will influence the other threads. – matteo Jun 12 '17 at 09:34
numpy.seed changes the seed globally. "globally" in the sense that all subsequent calls to numpy.random functions will depend on that seed. But each process is completely independent of one another by virtue of process management in Unix. Each process has it's own copy of memory and lives in its own world, unless sockets/queues/..etc are explicitly used for IPC – MohamedEzz Jun 12 '17 at 12:08
My knowledge of numpy's inner workings is limited, sorry about that :). However, for the part of the question "So I intentionally want to seed the global random generator"... my solution does work in your example. Have you tried it in a more general example? – matteo Jun 12 '17 at 12:25
You mean using `randgen` solves it? is this a third party library? As mentioned i have user code run inside the workers so I need numpy specifically to be seeded – MohamedEzz Jun 12 '17 at 13:24
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/146429/discussion-between-matteo-bachetti-and-mohamedezz). – matteo Jun 12 '17 at 14:19

score 0 · Answer 3 · answered Jun 12 '17 at 23:14

Making this a full-fledged answer since it doesn't fit in a comment.

After playing around a bit, something here smells like a numpy.random bug. I was able to reproduce the freezing bug, and in addition there were some other weird things that shouldn't happen, like manually seeding the generator not working.

def rand_seed(rand, i):
    print(i)
    np.random.seed(i)
    print(i)
    print(rand())
def test1():
    with multiprocessing.Pool() as pool:
        [pool.apply_async(rand_seed, (np.random.random_sample, i)).get()
        for i in range(5)]
test1()

has output

0
0
0.3205032737431185
1
1
0.3205032737431185
2
2
0.3205032737431185
3
3
0.3205032737431185
4
4
0.3205032737431185

On the other hand, not passing np.random.random_sample as an argument works just fine.

def rand_seed2(i):
    print(i)
    np.random.seed(i)
    print(i)
    print(np.random.random_sample())
def test2():
    with multiprocessing.Pool() as pool:
        [pool.apply_async(rand_seed, (i,)).get()
        for i in range(5)]
test2()

has output

0
0
0.5488135039273248
1
1
0.417022004702574
2
2
0.43599490214200376
3
3
0.5507979025745755
4
4
0.9670298390136767

This suggests some serious tomfoolery is going on behind the curtains. Not sure what else to say about it though....

Basically it seems like numpy.random.seed modifies not only the "seed state" variable, but the random_sample function itself.

I'm not sure why you pass random_sample to the workers. But it seems to pass the function with its state (RandomState). I'm surprised that passing `random_sample` worked at all, since it's a instance method which isn't picklable and thus not passable through a queue (used by Pool). But since this is a cython class, things could be different. `random_sample` is defined here https://github.com/numpy/numpy/blob/7cfec2403486456b52b525eccf7541e1562d9ab3/numpy/random/mtrand/mtrand.pyx#L814 It's an interesting problem but not quite related to my issue I believe — MohamedEzz, Jun 12 '17 at 23:28

Why this small snippet hangs using multiprocessing with maxtasksperchild, numpy.random.randint and numpy.random.seed?

3 Answers3