Creating a data that follows a spcific data distribution

Question

I have a variable x of 2700 points. It is my original data.

The HISTOGRAM of my data looks like this. The cyan color line is the distribution my data follows. I used curve_fit to my histogram and obtained the fitted curve. The fitted curve is a numpy array of 100000 points.

enter image description here

I want to generate a smoothed random data, of say 100000 points, that follows the DISTRIBUTION of my original data. i.e in principle I want 100000 points below the fitted curve, starting from 0.0 and increasing in the same way as the curve till 0.5

What I have tried so far to get 100000 points below the curve is:

I generated uniform random numbers using np.random.uniform(0,0.5,100000)

random_x = []

u = np.random.uniform(0,0.5,100000)

for i in u:
    if i<=y_ran:  # here y_ran is the numpy array of the fitted curve
        random_x.append(i)

But I get an error `ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I know the above code is not the proper one, but how should I proceed further?? `

@JanneKarila yes it is. I had to entirely rephrase my question. So I wrote a new one — Srivatsan, Jun 10 '14 at 12:06

stackspace · Answer 1 · 2014-06-10T13:30:32.130

0

Okay, so y_ran is a list of values that defines your curve. If I understand correctly, you want a random dataset that falls underneath your curve. One approach is to start with your curve points, and decrease each of them by some amount; for example, you could just make each new point equal somewhere in the range of 80%-100% of the original.

variation = np.random.uniform(low=.8, high=1.0, size=len(y_ran))
newData = y_ran * variation

Does that give you someplace to start?

edited Jun 10 '14 at 13:30

answered Jun 10 '14 at 11:57

stackspace

1
2

The problem is, I don't need uniform randoms. I need randoms that follow the above distribution of my data. i.e points below the cyan line, the fit – Srivatsan Jun 10 '14 at 11:59
the above does not eliminate the points above the curve. – Srivatsan Jun 10 '14 at 12:19
I misunderstood your question. I would use numpy.random.normal to generate such a distribution and a similar approach to the one I gave to ensure the data doesn't exceed your curve. Make sure you have the same number of generated points as you do points in your curve, otherwise the comparison doesn't make sense – stackspace Jun 10 '14 at 12:45
you mean np.random.normal(mean,stddev,100000) and then use your above conditions? – Srivatsan Jun 10 '14 at 12:52
the above conditions you gave are appending all the points from the sample u. They don't eliminate the points above y_ran – Srivatsan Jun 10 '14 at 12:55
Instead of approaching it like this, you could generate a random array equal in length to y_ran, whose values could contain anything between .8(ish) and 1.0. Then, your new generated data can be equal to the elementwise multiplication of the two arrays. Make sense? EDIT: "the two arrays" being the newly generated decimal array and y_ran – stackspace Jun 10 '14 at 13:04
could you please edit your answer with the above comment to make it more clear? P.S why 0.8 to 1.0 instead of 0.0 to 0.5 – Srivatsan Jun 10 '14 at 13:18
sure thing, working on it now. – stackspace Jun 10 '14 at 13:24

score 0 · Answer 2 · edited May 23 '17 at 11:49

I would approach the problem in the following way: first, fit your y_ran fitted curve to a gaussian (see for instance this question), and then draw your sample from a normal distribution with known coefficients by np.random.normal function. Something along these lines will work (in part taken from the answer to the question I'm referring to):

import numpy
from scipy.optimize import curve_fit    

# Define model function to be used to fit to the data above:
def gauss(x, *p):
    A, mu, sigma = p
    return A*numpy.exp(-(x-mu)**2/(2.*sigma**2))

# p0 is the initial guess for the fitting coefficients (A, mu and sigma above)
p0 = [1., 0., 1.]

coeff, var_matrix = curve_fit(gauss, x, y_ran, p0=p0)

sample = numpy.random.normal(*coeff, (100000,))

Note: 1. this is not tested, 2. you'll need x values for your fitted curve.

for my fitted curve in the above, I used x = np.linspace(0,0.5,100000) — Srivatsan, Jun 10 '14 at 12:03
Ok, then you have x. Fit your curve to a gaussian and then draw the sample from it. — Andrey Sobolev, Jun 10 '14 at 12:05

Creating a data that follows a spcific data distribution

2 Answers2