Probabilistically mutate dataframe

Question

Given

pd.DataFrame({'feature': [0.5,0.1,0.3,0.2,0.6,0.4,0.3], 'label': [0,1,2,2,1,2,0]})

I would like to apply the following rule: for all rows with feature greater than 0.2, there is a 60% chance that their label changes to 2. Otherwise it will remain unchanged.

My solution was:

df.loc[df.feature > 0.2, 'label'] = [
    np.random.choice(x, p=(0.6,0.4)) for x in zip(np.full(len(df.feature > 0.2), fill_value=2), df.loc[df.feature > 0.2, 'label'])]

Is there a simpler, vectorised way to do this?

score 2 · Accepted Answer · answered Oct 19 '21 at 08:54

2

Idea is set mask by percentage like this solution and set only selected values greater like 0.2:

N = 2
m = df.feature > 0.2
mask = np.random.choice([True, False], m.sum(), p=[0.6, 0.4])

df.loc[m, 'label'] = np.where(mask, N, df.loc[m, 'label'])

answered Oct 19 '21 at 08:54

jezrael

822,522
95
1,334
1,252

Probabilistically mutate dataframe

1 Answers1