Create distribution in Pandas

Question

I want to generate a random/simulated data set with a specific distribution.

As an example the distribution has the following properties.

A population of 1000
The Gender mix is: male 49%, female 50%, other 1%
The age has the following distribution: 0-30 (30%), 31-60 (40%), 61-100 (30%)

The resulting data frame would have 1000 rows, and two columns called gender and age (with the above value distributions)

Is there a way to do this in Pandas or another library?

Do you want exactly those % mixes? Or do you want to create a sample with those probabilities? For age, what does 61+ mean (what is the upper cap? 100? 120?). Is age uniformly distributed within the age brackets? Or is age just an indicator of category and not an actual number? — noah, Sep 24 '20 at 17:38

Sergey Bushmanov · Answer 1 · 2020-09-24T18:00:01.433

0

You may try:

N = 1000
gender = np.random.choice(["male","female", "other"], size=N, p = [.49,.5,.01])

age = np.r_[np.random.choice(range(30),size= int(.3*N)),
       np.random.choice(range(31,60),size= int(.4*N)),
       np.random.choice(range(61,100),size= N - int(.3*N) - int(.4*N) )]
np.random.shuffle(age)

df = pd.DataFrame({"gender":gender,"age":age})

edited Sep 24 '20 at 18:00

answered Sep 24 '20 at 17:38

Sergey Bushmanov

23,310
7
53
72

use `randint(31,61,size=...)` might be faster. – Quang Hoang Sep 24 '20 at 17:41
my read from the OP was that the ages would be presented as brackets and could be generated in the same fashion as the genders. – Paul H Sep 24 '20 at 17:50
@PaulH 61+ implies a distribution with a long rigth tail – Sergey Bushmanov Sep 24 '20 at 17:51
totally, but what I'm saying is that the array should be selected from three categories, not continuous distributions. the question is ambiguous. i'm just sharing my intrepretation – Paul H Sep 24 '20 at 17:52
@PaulH totally agree with you – Sergey Bushmanov Sep 24 '20 at 17:53

Create distribution in Pandas

1 Answers1

Linked