-1

I am trying to create a dataframe with 4 columns 'date', 'age', 'conversion', 'marital_status'. Where marital status is one of 4 choices (married, divorced, single, unknown). I am able to create the dataframe using the following code. However, I am not sure how to specify the frequency. I want married to be 50%, divorced 30%, single 15% and rest unknown. How do I do this.

import pandas as pd
import numpy as np
import random

random.seed(30)
np.random.seed(30)

start_date,end_date = '1/1/2015','12/31/2019'
date_rng = pd.date_range(start= start_date, end=end_date, freq='D')
length_of_field = date_rng.shape[0]
df = pd.DataFrame(date_rng, columns=['date'])
df['age'] = np.random.randint(18,100,size=(len(date_rng)))
df['conversion'] = np.random.randint(0,2,size=(len(date_rng)))
marital_status = ('divorced','married','single','unknown')
group_1 = [random.choice(marital_status) for _ in range(length_of_field)]
df['marital_status'] = group_1
print('\ndf:')
print(df)


Alhpa Delta
  • 3,385
  • 4
  • 16
  • 31

3 Answers3

1

You can use numpy.random.choice. p parameter specifies the probability of each class.

import numpy as np
np.random.choice(marital_status, len(length_of_field), p = [0.3, 0.5, 0.15, 0.5])
cmxu
  • 954
  • 5
  • 13
1

Try:

np.random.choice(['divorced','maried','single','unknown'], size = len(date_rng), p = [0.5, 0.3,0.15,0.05])
0

You can use random.choices (inspired by this question):

marital_status = random.choices(
    population=['divorced','married','single','unknown'],
    weights=[0.3, 0.5, 0.15, 0.05],
    k=df.shape[0]
)
df['marital_status'] = marital_status
XavierBrt
  • 1,179
  • 8
  • 13