24

I have very simple cases where the work to be done can be broken up and distributed among workers. I tried a very simple multiprocessing example from here:

import multiprocessing
import numpy as np
import time

def do_calculation(data):
    rand=np.random.randint(10)
    print data, rand
    time.sleep(rand)
    return data * 2

if __name__ == '__main__':
    pool_size = multiprocessing.cpu_count() * 2
    pool = multiprocessing.Pool(processes=pool_size)

    inputs = list(range(10))
    print 'Input   :', inputs

    pool_outputs = pool.map(do_calculation, inputs)
    print 'Pool    :', pool_outputs

The above program produces the following output :

Input   : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
0 7
1 7
2 7
5 7
3 7
4 7
6 7
7 7
8 6
9 6
Pool    : [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Why is the same random number getting printed? (I have 4 cpus in my machine). Is this the best/simplest way to go ahead?

imsc
  • 7,492
  • 7
  • 47
  • 69
  • possible duplicate of [Using python multiprocessing with different random seed for each process](http://stackoverflow.com/questions/9209078/using-python-multiprocessing-with-different-random-seed-for-each-process) –  Sep 14 '15 at 15:46
  • Is there no way to set the random number for every process that might use random numbers? Say one uses the module random, numpy, scipy, tensorflow and who knows what else. Is the only way to make sure the process has a different random seed to go through each of these and manually set the state? – Charlie Parker Apr 05 '17 at 05:14

4 Answers4

23

I think you'll need to re-seed the random number generator using numpy.random.seed in your do_calculation function.

My guess is that the random number generator (RNG) gets seeded when you import the module. Then, when you use multiprocessing, you fork the current process with the RNG already seeded -- Thus, all your processes are sharing the same seed value for the RNG and so they'll generate the same sequences of numbers.

e.g.:

def do_calculation(data):
    np.random.seed()
    rand=np.random.randint(10)
    print data, rand
    return data * 2
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • Can you show me how to put `seed` in `do_calculation`. If I put `seed` in `main` I still get similar output. – imsc Oct 16 '12 at 13:06
  • @imsc -- Sorry, I didn't read carefully enough. You want `np.random.seed` (not `random.seed`). I've updated accordingly. – mgilson Oct 16 '12 at 13:10
  • @imsc - Are you sure? I can reproduce your original behavior on my laptop (only 2 cores), but it gets better when I add `np.random.seed()`. Another thing that might be making this a little more cloudy is the `pool_size = multiprocessing.cpu_count() * 2`. Perhaps try just using `cpu_count()`. You don't really gain much using more than that anyway I wouldn't think... – mgilson Oct 16 '12 at 13:15
  • Thanks a lot. Previously I put the seed after calculating the random number. – imsc Oct 16 '12 at 13:59
  • @imsc -- Whoops, That'll do it :) – mgilson Oct 16 '12 at 14:00
  • Is there no way to set the random number for every process that might use random numbers? Say one uses the module random, numpy, scipy, tensorflow and who knows what else. Is the only way to make sure the process has a different random seed to go through each of these and manually set the state? – Charlie Parker Apr 05 '17 at 03:08
  • @CharlieParker -- Yep, that's pretty much the only way that I can see it happening. If you avoid importing the module until after you've forked the process, then you might be OK, but that seems to be fragile at best. – mgilson Apr 05 '17 at 05:11
  • @mgilson I want to share numpy random state of a parent process with a child process. I've tried using `Manager` but still no luck. Could you please take a look at my question [here](https://stackoverflow.com/questions/49372619/how-to-share-numpy-random-state-of-a-parent-process-with-child-processes) an see if you can offer a solution? I can still get different random numbers if I do `np.random.seed(None)` every time that I generate a random number, but this does not allow me to use the random state of the parent process, which is not what I want. Any help is greatly appreciated. – Amir Mar 20 '18 at 02:08
2

This blog post provides an example of a good and bad practise when using numpy.random and multi-processing. The more important is to understand when the seed of your pseudo random number generator (PRNG) is created:

import numpy as np
import pprint
from multiprocessing import Pool

pp = pprint.PrettyPrinter()

def bad_practice(index):
    return np.random.randint(0,10,size=10)

def good_practice(index):
    return np.random.RandomState().randint(0,10,size=10)

p = Pool(5)

pp.pprint("Bad practice: ")
pp.pprint(p.map(bad_practice, range(5)))
pp.pprint("Good practice: ")
pp.pprint(p.map(good_practice, range(5)))

output:

'Bad practice: '
[array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
 array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
 array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
 array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
 array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9])]
'Good practice: '
[array([8, 9, 4, 5, 1, 0, 8, 1, 5, 4]),
 array([5, 1, 3, 3, 3, 0, 0, 1, 0, 8]),
 array([1, 9, 9, 9, 2, 9, 4, 3, 2, 1]),
 array([4, 3, 6, 2, 6, 1, 2, 9, 5, 2]),
 array([6, 3, 5, 9, 7, 1, 7, 4, 8, 5])]

In the good practice the seed is created once per thread while in the bad practise the seed is created only once when you import the numpy.random module.

t_sic
  • 79
  • 1
1

If you just want the legacy np.random generators to to be distinct, then you can just pass np.random.seed to the Pool's initializer:

from multiprocessing import Pool
import numpy as np

def foo(_):
    return np.random.random()

with Pool(initializer=np.random.seed) as pool:
    print(pool.map(foo, range(5)))

this will cause the random generator to be reseeded in each worker process by pulling in fresh entropy from the OS.

If you're running Python 3.7+, you might want to use os.register_at_fork instead:

from os import register_at_fork

register_at_fork(after_in_child=np.random.seed)

with Pool() as pool:
    print(pool.map(foo, range(5)))

this has the advantage of working whether multiprocessing is doing the forking or not.

If you care about deterministically seeding worker processes then you likely want to use a SeedSequence as pointed out by @hasManyStupidQuestions. This also has the advantage of using the newer and faster RNGs.

Numpy issue 9650 has even more details.

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
0

Here's what I use (may require newer versions of NumPy):

import numpy as np
from multiprocessing import Pool

entropy = 42
seed_sequence = np.random.SeedSequence(entropy)

number_processes = 5

seeds = seed_sequence.spawn(number_processes)

def good_practice(seed):
    rng = np.random.default_rng(seed)
    return rng.integers(0,10,size=10)

pool = Pool(number_processes)


print(pool.map(good_practice, seeds))

Output:

[array([4, 9, 5, 9, 2, 8, 3, 3, 5, 9]), 
 array([0, 4, 1, 0, 6, 5, 3, 1, 7, 9]), 
 array([7, 0, 7, 7, 1, 0, 1, 3, 9, 6]), 
 array([8, 7, 9, 9, 1, 7, 4, 0, 5, 2]), 
 array([9, 0, 8, 9, 3, 8, 6, 6, 7, 9])]

The NumPy documentation on this was actually fairly helpful. See also.