If relevant probability is always the largest one use max
with only proba_
columns:
df['prediction'] = np.where(df.filter(like='proba_').max(axis=1) <= 0.9,
df['label'],
df['prediction'])
Use melting with select by columns names (instead lookup
) and then set new values in numpy.where
:
melt = df.melt(['label','prediction'], ignore_index=False)
df['val'] = melt.loc['proba_' + melt['prediction'] == melt['variable'], 'value']
df['prediction'] = np.where(df['val'] <= 0.9, df['label'], df['prediction'])
print (df)
host label prediction proba_label1 proba_label3 proba_label2 val
0 A label1 label1 0.90 0.10 0.00 0.9
1 B label2 label3 0.03 0.95 0.02 0.95
2 A label1 label1 0.20 0.75 0.05 0.75
Solution without helper column:
melt = df.melt(['label','prediction'], ignore_index=False)
s = melt.loc['proba_' + melt['prediction'] == melt['variable'], 'value']
df['prediction'] = np.where(s <= 0.9, df['label'], df['prediction'])
#if some labels not match this is safer like np.where
#df.loc[s <= 0.9, 'prediction'] = df['label']
print (df)
host label prediction proba_label1 proba_label3 proba_label2
0 A label1 label1 0.90 0.10 0.00
1 B label2 label3 0.03 0.95 0.02
2 A label1 label1 0.20 0.75 0.05
Performance:
data = {'host': ['A','B','A'],
'label': ['label1', 'label2', 'label1'],
'prediction': ['label1', 'label3', 'label3'],
'proba_label1': [0.9, 0.03, 0.2],
'proba_label3': [0.1, 0.95, 0.75],
'proba_label2': [0, 0.02, 0.05]
}
df = pd.DataFrame(data)
#[30000 rows
df = pd.concat([df] * 10000, ignore_index=True)
#deleted answer by @Nk03
In [85]: %timeit df.apply( lambda x: x['label'] if x[f"proba_{x['prediction']}"] <= 0.9 else x['prediction'], 1)
455 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [86]: %timeit df.apply(fun, axis=1)
482 ms ± 58.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [87]: %%timeit
...: melt = df.melt(['label','prediction'], ignore_index=False)
...: df['val'] = melt.loc['proba_' + melt['prediction'] == melt['variable'], 'value']
...:
...: df['prediction'] = np.where(df['val'] <= 0.9, df['label'], df['prediction'])
...:
72.2 ms ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)