1

I want to generate a random/simulated data set with a specific distribution.

As an example the distribution has the following properties.

  1. A population of 1000
  2. The Gender mix is: male 49%, female 50%, other 1%
  3. The age has the following distribution: 0-30 (30%), 31-60 (40%), 61-100 (30%)

The resulting data frame would have 1000 rows, and two columns called gender and age (with the above value distributions)

Is there a way to do this in Pandas or another library?

Mustard Tiger
  • 3,520
  • 8
  • 43
  • 68
  • 1
    `numpy.random.choice` – Paul H Sep 24 '20 at 17:29
  • Do you want exactly those % mixes? Or do you want to create a sample with those probabilities? For age, what does 61+ mean (what is the upper cap? 100? 120?). Is age uniformly distributed within the age brackets? Or is age just an indicator of category and not an actual number? – noah Sep 24 '20 at 17:38
  • I edited the upward bound for age – Mustard Tiger Sep 24 '20 at 18:14

1 Answers1

0

You may try:

N = 1000
gender = np.random.choice(["male","female", "other"], size=N, p = [.49,.5,.01])

age = np.r_[np.random.choice(range(30),size= int(.3*N)),
       np.random.choice(range(31,60),size= int(.4*N)),
       np.random.choice(range(61,100),size= N - int(.3*N) - int(.4*N) )]
np.random.shuffle(age)

df = pd.DataFrame({"gender":gender,"age":age})
Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72