Update on 7/19/2017
Found the solution here: https://stackoverflow.com/a/10021912/5729266
A quick conclusion if you don't want to read to the end.
The inconsistency of random numbers in my previous code was caused by thread-unsafe, because random module is considered as a global variable even each thread deals with one individual instance containing the random module.
To solve the problem, you have to use either thread-lock or generate an independent random instance as described in the link above. See test code below.
import threading
import random
class do_threads:
def __init__(self):
# Using random directly is thread-unsafe
# self.random = random
# instead of using random, create a local random instance
self.random = random.Random()
def __call__(self, n):
self.n = n
self.run_thread()
def get_balance(self, e):
self.random.seed(self.n)
return self.random.uniform(0, 1)
def run_thread(self):
total = []
for i in range(100000):
total.append(self.get_balance())
print(sum(total) / 100000)
a = do_threads()
b = do_threads()
t1 = threading.Thread(target=a, args=(5,))
t2 = threading.Thread(target=b, args=(8,))
t1.start()
t2.start()
t1.join()
t2.join()
Old post:
In my Python program I need to run N subprocesses using multiprocessing.pool
. Every subprocess spawns M threads, each of which needs to generate hashcode for IDs in column 'ID' of a dataframe.
The hash codes need to follow the distribution of uniform(0,1). To do this, I used ID as seed (random.seed(ID)
) to set random state and then produced a random key from random.uniform(0, 1)
. But there were about 0.01% chance that an ID has different random numbers. For example, an ID ’200300’ appears 10000 times among all these threads/subprocesses, but 9999 times it has one random key and 1 time it has another random key.
So, my question is: Does random.seed(seed) generate same sequence in parallel programs all the time? If not, how can I fix the random state to ensure random.uniform(0, 1)
to pop the same number given the same ID? I am also open for other methods that can hash ID into a random variable with uniform(0,1) distribution.
Just note, that I want to use Process and threads for my work and cannot concatenate these dataframes during the program to generate random keys all in once.
I tried using multiprocessing.Manager
to share the random state or import random in parent process or pass random_generator()
as instance or object from parent process to child environment. But it seems that things do not work as expected.
Here is a simple version of my code:
#mythreads.py
from foo import Foo
class TaskWorker(Thread):
def __init__(self, queue):
Thread.__init__(self)
self.queue = queue
def run(self):
while True:
Foo, task = self.queue.get()
Foo(task).generate_data(df)
def mythreads():
queue = Queue()
for x in range(10):
worker = TaskWorker(queue)
worker.daemon = True
worker.start()
for task in sub_list:
queue.put((Foo, task))
queue.join()
# foo.py
import random
class Foo:
def __init__(self, task):
...
def random_generator(self, e):
random.seed(e)
randomkey = random.uniform(0, 1)
def generate_data(self, df):
df['RK'] = df[‘ID’].apply(self.random_generator)
...
#main.py
from multiprocessing.pool import Pool
from mythreads import mythreads
with Pool(N) as p:
p.map(mythreads, list_of_sublists)
Note: I use Python 3.6