5

I want to see what random number generator package is faster in my neural network.

I am currently changing a code from github, in which both numpy.random and random packages are used to generate random integers, random choices, random samples etc.

The reason that I am changing this code is that for research purposes I would like to set a global seed to be able to compare accuracy performance for different settings of hyperparameters. The problem is that at this moment I have to set 2 global seeds, both for the random package and for the numpy package. Ideally, I would like to set only one seed as drawings from two sequences of random number generators might become correlated more quickly.

However, I do not know what package will perform better (in terms of speed): numpy or random. So I would like to find seeds for both packages that correspond to exactly the same Mersenne Twister sequence. In that way, the drawings for both models are the same and therefore also the number of iterations in each gradient descent step are the same, leading to a difference in speed only caused by the package I use.

I could not find any documentation on pairs of seeds that end up in the same random number sequence for both packages and also trying out all kind of combinations seems a bit cumbersome.

I have tried the following:

np.random.seed(1)
numpy_1=np.random.randint(0,101)
numpy_2=np.random.randint(0,101)
numpy_3=np.random.randint(0,101)
numpy_4=np.random.randint(0,101)
for i in range(20000000):
    random.seed(i)
    random_1=random.randint(0,101)
    if random_1==numpy_1:
        random_2=random.randint(0,101)
        if random_2==numpy_2:
            random_3=random.randint(0,101)
            if random_3==numpy_3:
                random_4=random.randint(0,101)
                if random_4==numpy_4:
                    break
print(np.random.randint(0,101))
print(random.randint(0,101))

But this did not really work, as could be expected.

Paolo6
  • 53
  • 1
  • 5
  • `np.sin` is slower than `math.sin` when processing one value, but faster when working with a large array. It's likely the same applies to `random`. `np.random` could be slower for one value, but faster if asked to provide thousands. – hpaulj Jul 26 '19 at 20:52

3 Answers3

12

numpy.random and python random work in different ways, although, as you say, they use the same algorithm.

In terms of seed: You can use the set_state and get_state functions from numpy.random (in python random called getstate and setstate) and pass the state from one to another. The structure is slightly different (in python the pos integer is attached to the last element in the state tuple). See the docs for numpy.random.get_state() and random.getstate():

import random
import numpy as np
random.seed(10)
s1 = list(np.random.get_state())
s2 = list(random.getstate())

s1[1] = np.array(s2[1][:-1]).astype('int32')
s1[2] = s2[1][-1]

np.random.set_state(tuple(s1))

print(np.random.random())
print(random.random())
>> 0.5714025946899135
0.5714025946899135

In terms of efficiency: it depends on what you want to do, but numpy is usually better because you can create arrays of elements without the need of a loop:

%timeit np.random.random(10000)
142 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit [random.random() for i in range(10000)]
1.48 ms ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In terms of "randomness", numpy is (according to their docs), also better:

Notes: The Python stdlib module "random" also contains a Mersenne Twister pseudo-random number generator with a number of methods that are similar to the ones available in RandomState. RandomState, besides being NumPy-aware, has the advantage that it provides a much larger number of probability distributions to choose from.

Tarifazo
  • 4,118
  • 1
  • 9
  • 22
  • 1
    The lack of an explicit loop, called vectorization, very often yields a significant increase in performance. – griffin_cosgrove Jul 26 '19 at 14:38
  • In terms of "randomness", numpy is not better. What you wrote is that it has more probability distributions. It's a different thing, it is richer (not more random). – Yaroslav Nikitenko Jul 22 '20 at 11:17
1

Consider the following dirty hack:

import random
import numpy as np

random.seed(42)
np.random.seed(42)

print(random.random(), np.random.random())

# copy numpy random module state to python random module
a = random.getstate()
b = np.random.get_state()
a2 = (a[0], tuple(int(val) for val in list(b[1]) + [a[1][-1]]), *a[2:])
random.setstate(a2)

print(random.random(), np.random.random())

Output:

0.6394267984578837 0.3745401188473625  # different
0.9507143064099162 0.9507143064099162  # same

Not sure if this way really consistent across all the possibilities of both implementations.

  • 1
    Would you mind expanding a bit about this hack? Thank you. What are these tuples, etc. – Yaroslav Nikitenko Jul 22 '20 at 11:44
  • @YaroslavNikitenko The same pseudo random generator implemented a little bit differently in python standard library and numpy. So these tuples represent different formats of the same state. I call it "hack" because not sure if this way always correct and consistent. –  Jul 22 '20 at 12:22
0

Duplication of this post

Answer depends of the needs :
- Cryptography / security : secrets
- Scientific Research : numpy
- Common Use : random

IQbrod
  • 2,060
  • 1
  • 6
  • 28
  • I have read that post earlier, I would not say this is a duplication of that post. I would like to compare the overall performance of both packages in my whole code. It would be a lot of work to compare every function from both packages on performance and even if I do, I should also count how many times every random/numpy.random function is being called. It would be nice if I could just run the whole program twice; once with numpy.random and once with random package, with the same sequence of random numbers so I really know for my program what the difference in performance is. – Paolo6 Jul 26 '19 at 13:53
  • "The reason that I am changing this code is that for research purposes" => "The numpy.random library contains a few extra probability distributions commonly used in scientific research" Your "test" algorithm also has to evaluate on an average results (processor will not schedule two programs the same way, maybe something consuming will be running at the same time etc ...) – IQbrod Jul 26 '19 at 13:57
  • I see what you're getting at. I try to account for processor usage by running both programs serially on a server that is only used by that program. Of course there can always be small deviations, I try to control for as much as possible. But I think a different sequence in random numbers will be a larger problem in this case. Of course I could also run both programs multiple times without any seed and average out the performance speed, but I might end up doing way more runs that way. And since one run takes at least one hour, I was hoping for a smarter solution. – Paolo6 Jul 26 '19 at 14:04