Creating fake data in Python

Question

I am trying to create a function that creates fake data to use in a separate analysis. Here are the requirements for the function.

Problem 1

In this problem you will create fake data using numpy. In the cell below the function create_data takes in 2 parameters "n" and "rand_gen.

The "rand_gen" parameter is a pseudo-random number generator. We are using a pseudo-random number generator to produce the same results.
Use the numpy.random.randn function of the pseudo-random generator to create a numpy array of length n and return the array.

Here is the function I have created.

def create_data(n, rand_gen):
'''
Creates a numpy array with n samples from the standard normal distribution

Parameters
-----------
n : integer for the number of samples to create
rand_gen : pseudo-random number generator from numpy  

Returns
-------
numpy array from the standard normal distribution of size n
'''

numpy_array = np.random.randn(n)
return numpy_array

Here is the first test I run on my function.

create_data(10, np.random.RandomState(seed=23))

I need the output to be this exact array.

[0.66698806, 0.02581308, -0.77761941, 0.94863382, 0.70167179,
                       -1.05108156, -0.36754812, -1.13745969, -1.32214752,  1.77225828]

My output is still completely random and I do not fully understand what the RandomState call is trying to do with the seed to create the above array rather than have it be completely random. I know I need to use the rand_gen variable in my function, but I do not know where and I think it's because I just don't understand what it is trying to do.

Related? https://stackoverflow.com/questions/22994423/difference-between-np-random-seed-and-np-random-randomstate — DavidG, Oct 18 '18 at 18:23
You're not using `rand_gen` at all in your function? It looks like you create a seeded generator and then just default back to the standard module RNG — roganjosh, Oct 18 '18 at 18:23
I don't know `numpy`, but I'm able to reproduce the required results. Take a look at `numpy.random.seed`. You need to set the seed before you go and get your array. — UtahJarhead, Oct 18 '18 at 18:25
Correct, I am not sure how to use it. I tried using it where I am currently using n, but that gave me an error and then I wasn't using n, so I took it out because I could still get an array, just not the one I need. — Thomas, Oct 18 '18 at 18:26
Just call `rand_gen.randn(n)`. [docs](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.RandomState.randn.html#numpy.random.RandomState.randn) — sascha, Oct 18 '18 at 18:29

score 1 · Accepted Answer · answered Oct 18 '18 at 18:33

1

Define numpy_array = rand_gen.randn(n)

answered Oct 18 '18 at 18:33

Thomas

113
3
14

score 1 · Answer 2 · answered Oct 18 '18 at 19:17

I think the question you are asking is about pseudo-random numbers and reproducible randoms.

Real random numbers are made with real-word unpredictable data, like watching lava lamps, while pseudo-random numbers create a long sequence of numbers that appears random.

The basic algorithm is:

get a seed, or a big number, maybe from the current clock time.
take part of the seed as the random number
do unspeakable mathematical mutilations to the seed involving bit-shifts, exponents, and multiplications.
use the output of these calculations as the new seed, go to step 2.

The trick is that specifying the same seed means you get the same sequence every time. You can set this with numpy.random.seed() and then get the same sequence each time.

I hope this is the question you were asking.

Creating fake data in Python

2 Answers2