1

I have a set of existing data, lets say:

sample_data = [2,2,2,2,2,2,3,3,3,3,4,4,4,4,4]

off of this sample data, i would like to generate a random set of data of a certain length. This should not be off of the sample data, but off of a distribution that was generated off of the sample data.

expected output if i wanted 5 random points:

output_data = [3.4,2.3,1.5,5.2,1.3]

3 Answers3

2

Use random.sample :

import random

sample_data = [2,2,2,2,2,2,3,3,3,3,4,4,4,4,4]
# if you want to select 5 samples from above data
print(random.sample(sample_data, 5))

Output:

[3, 2, 2, 4, 2]
Sociopath
  • 13,068
  • 19
  • 47
  • 75
  • hey - I dont want to select x amount of samples from the data, but rather generate data based on the existing data. – Brian Chen Feb 01 '19 at 17:47
  • 1
    what's the difference between you prior and later sentence? Maybe you need to edit the question and elaborate further. – Sociopath Feb 01 '19 at 17:51
  • To clarify - I would like to find a distribution fit off of a data set, and then create a random set of data based off of that distribution. – Brian Chen Feb 01 '19 at 17:54
  • 2
    @BrianChen This is not what was asked in the question, please edit. – Rocky Li Feb 01 '19 at 17:57
1
import numpy as np
length = 3
sample_data = [2,2,2,2,2,2,3,3,3,3,4,4,4,4,4]

np.random.choice(sample_data, length, False) #Sampling without replacement
Out[287]: array([4, 4, 2])
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • hey - i dont want to select x amount of samples from the data, but rather generate data based on the existing data. – Brian Chen Feb 01 '19 at 17:48
  • @BrianChen just remove `False` from the above code and run the code with length being 30 for example – Onyambu Feb 01 '19 at 17:56
  • it is still just outputting values from the data set - not generating new data points based off of a distribution. – Brian Chen Feb 01 '19 at 18:27
  • What do you mean by generating new data points based of a distribution? can you elaborate more? – Onyambu Feb 01 '19 at 18:57
  • hey, thanks for replying - i would like for python to determine what kind of distribution the data best fits to (as well as parameters) and use this data to create x amount of random data from this new distribution/parameters. For example, my data set best fits a normal distribution of (10,1), then use this normal distribution of (10,1) to generate 15 new data points – Brian Chen Feb 05 '19 at 15:14
  • @BrianChen your question has already been answered [here](https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-ones-with-scipy-python) – Onyambu Feb 05 '19 at 17:10
1

There's an important premise of the question that needs to be decided: what kind of distribution do you want?. Now as humans we probably can classify distribution by the shape of it, when we have enough data. But machines don't, to install an distribution type, say uniform or binomial to a new input is arbitrary. Here I'll provide a brief answer with the gold standard of statistic - normal distribution (according to Central Limit Theorem, sufficient large sample sizes converge to normal)

import numpy as np

sample_data = [2,2,2,2,2,2,3,3,3,3,4,4,4,4,4]
size = 5
new_samples = np.random.normal(np.mean(sample_data), np.std(sample_data), size)

>>> new_samples
array([ 2.01221231,  2.62772975,  1.79965428,  3.83601719,  2.44967777])

The new samples are generated by a normal distribution that assume the mean and standard deviation of the original samples.

Rocky Li
  • 5,641
  • 2
  • 17
  • 33
  • hey, thanks for replying - i would like for python to determine what kind of distribution the data best fits to (as well as parameters) and use this data to create x amount of random data from this new distribution/parameters. For example, my data set best fits a normal distribution of (10,1), then use this normal distribution of (10,1) to generate 15 new data points. – Brian Chen Feb 05 '19 at 15:13