3

Update on 7/19/2017

Found the solution here: https://stackoverflow.com/a/10021912/5729266

A quick conclusion if you don't want to read to the end.

The inconsistency of random numbers in my previous code was caused by thread-unsafe, because random module is considered as a global variable even each thread deals with one individual instance containing the random module.

To solve the problem, you have to use either thread-lock or generate an independent random instance as described in the link above. See test code below.

import threading
import random

class do_threads:

    def __init__(self):
        # Using random directly is thread-unsafe
        # self.random = random 

        # instead of using random, create a local random instance
        self.random = random.Random()

    def __call__(self, n):
        self.n = n
        self.run_thread()

    def get_balance(self, e):
        self.random.seed(self.n)
        return self.random.uniform(0, 1)


    def run_thread(self):
        total = []
        for i in range(100000):
           total.append(self.get_balance())
        print(sum(total) / 100000)

a = do_threads()
b = do_threads()

t1 = threading.Thread(target=a, args=(5,))
t2 = threading.Thread(target=b, args=(8,))
t1.start()
t2.start()
t1.join()
t2.join()

Old post:

In my Python program I need to run N subprocesses using multiprocessing.pool. Every subprocess spawns M threads, each of which needs to generate hashcode for IDs in column 'ID' of a dataframe.

The hash codes need to follow the distribution of uniform(0,1). To do this, I used ID as seed (random.seed(ID)) to set random state and then produced a random key from random.uniform(0, 1). But there were about 0.01% chance that an ID has different random numbers. For example, an ID ’200300’ appears 10000 times among all these threads/subprocesses, but 9999 times it has one random key and 1 time it has another random key.

So, my question is: Does random.seed(seed) generate same sequence in parallel programs all the time? If not, how can I fix the random state to ensure random.uniform(0, 1) to pop the same number given the same ID? I am also open for other methods that can hash ID into a random variable with uniform(0,1) distribution.

Just note, that I want to use Process and threads for my work and cannot concatenate these dataframes during the program to generate random keys all in once.

I tried using multiprocessing.Manager to share the random state or import random in parent process or pass random_generator() as instance or object from parent process to child environment. But it seems that things do not work as expected. 

Here is a simple version of my code:

#mythreads.py
from foo import Foo

class TaskWorker(Thread):
        def __init__(self, queue):
            Thread.__init__(self)
            self.queue = queue
        def run(self):
            while True:
                Foo, task = self.queue.get()
                Foo(task).generate_data(df)

def mythreads():
    queue = Queue()
    for x in range(10):
        worker = TaskWorker(queue)
        worker.daemon = True
        worker.start()
    for task in sub_list:
        queue.put((Foo, task))
    queue.join()

# foo.py
import random
class Foo:
    def __init__(self, task):
        ...

    def random_generator(self, e):
        random.seed(e)
        randomkey = random.uniform(0, 1)

    def generate_data(self, df):
        df['RK'] = df[‘ID’].apply(self.random_generator)
        ...


 
 


#main.py
from multiprocessing.pool import Pool
from mythreads import mythreads
with Pool(N) as p:
    p.map(mythreads, list_of_sublists)

Note: I use Python 3.6

Lena.C
  • 41
  • 3
  • 1
    Why do you want repeatable random numbers in the first place? I question the usefulness of deterministic random number generating as you might just as well hash your ID and be done with it. – zwer Apr 25 '18 at 19:10
  • What you want to do is: generate N seeds (e.g., by just pulling them from the RNG) before starting the pool (or doing anything else nondeterministic), then pass one of those seeds (in order) to either the pool processes, or the tasks, or the task chunks, so they can seed with it. – abarnert Apr 25 '18 at 19:15
  • I’ve written an answer on one of these pretty recently; once I get back to my computer I’ll search for it and see if it’s a useful dup. – abarnert Apr 25 '18 at 19:15
  • Don't use a random number generator if what you want isn't really random. Consider using a UUID instead. For example, generate one UUID with `uuid.uuid4()`, and pass that as an argument to each new `TaskWorker`. Then `df['RK'] = uuid.uuid3(self.base_uuid, df['ID'])`. – chepner Apr 25 '18 at 19:16
  • See https://tools.ietf.org/html/rfc4122.html#section-4.3 – chepner Apr 25 '18 at 19:22
  • Forgot to mention one thing in my post and just added it. Basically I have to assign a key with distribution of uniform(0,1) to an id. So that same ids will have same key. – Lena.C Apr 25 '18 at 19:54

2 Answers2

3

Summary

Q. Does random.seed(seed) generate same sequence in parallel programs all the time?

A. Yes.

The random number generator is guaranteed reproduce the same series of random values given the same starting seed.

One other thought: Use random.random() instead of random.uniform(0, 1). Both give the same range of random variables, but the former is both faster and more idiomatic.

Example

Demonstration of separate processes running different generators starting with the same seed:

from multiprocessing.pool import Pool
from pprint import pprint
import random

def make_seq(identifier):
    random.seed(8675309)
    seq = [random.random() for i in range(4)]
    return identifier, seq

p = Pool(10)
pprint(list(p.map(make_seq, range(10))), width=100)

Output:

[(0, [0.40224696110279223, 0.5102471779215914, 0.6637431122665531, 0.8607166923395507]),
 (1, [0.40224696110279223, 0.5102471779215914, 0.6637431122665531, 0.8607166923395507]),
 (2, [0.40224696110279223, 0.5102471779215914, 0.6637431122665531, 0.8607166923395507]),
 (3, [0.40224696110279223, 0.5102471779215914, 0.6637431122665531, 0.8607166923395507]),
 (4, [0.40224696110279223, 0.5102471779215914, 0.6637431122665531, 0.8607166923395507]),
 (5, [0.40224696110279223, 0.5102471779215914, 0.6637431122665531, 0.8607166923395507]),
 (6, [0.40224696110279223, 0.5102471779215914, 0.6637431122665531, 0.8607166923395507]),
 (7, [0.40224696110279223, 0.5102471779215914, 0.6637431122665531, 0.8607166923395507]),
 (8, [0.40224696110279223, 0.5102471779215914, 0.6637431122665531, 0.8607166923395507]),
 (9, [0.40224696110279223, 0.5102471779215914, 0.6637431122665531, 0.8607166923395507])]

Note that all the processes generated the same values.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • I wanted to agree with you because that was also what I initially thought. But I noticed in my final results that 70 out of 2600 unique IDs have an extra different random key with a small chance. For example, ID 2000000247 received 0.0545 from `random.seed(2000000247); random.random()` 99.9% of time but also received a random key 0.8412. I ran the code many many times, it seems that the inconsistency is also random. – Lena.C Apr 26 '18 at 05:06
  • @Lena.C Are you rounding? The output of random.random() should go out to 17 decimal places: 0.054533502347412166 . The odds of seeing that number again is 1 in 2**53. If the ID gets rounded to 0.0545, then yes, you can expect a lot of collisions. That isn't a problem with random number generation; that is just math. – Raymond Hettinger Apr 26 '18 at 06:01
  • I did not round but only pasted the first few digits for short in the comment. Sorry for the confusion. What I wanted to address was that with the same ID 2000000247 seeded by `random.seed()`, `random.random()` did not always return the same number. One number is 0.054533502347412166, another is 0.841232357249359. – Lena.C Apr 27 '18 at 17:28
0

It sounds that what you really want is not a random number, but a hash of the ID. Check out Hashing strings with Python.

With a hash, you get hash keys that are evenly distributed and identical witdth, but the same ID will always translate to the same hash key. Hashe keys will look random. It will be difficult to deduce the original ID from the hash key. If security is an issue (if it needs to be really difficult to figure out the IDs from the keys), avoid MD5, but otherwise MD5 should be fine.

>>> import hashlib
>>> print (hashlib.md5('This is a test').hexdigest())
ce114e4501d2f4e2dcea3e17b546f339
Jeff Learman
  • 2,914
  • 1
  • 22
  • 31
  • 1
    Also look at *uuid* which is designed for this task: https://docs.python.org/3/library/uuid.html#module-uuid – Raymond Hettinger Apr 26 '18 at 04:02
  • My answer here doesn't conform to some of the OP's request, which changed after I posted this. The result isn't in the range [0,1), for example. – Jeff Learman Apr 26 '18 at 14:35