0

I have a large dataframe where I would like to change values of certain rows based on values of other columns. The problem with my for loop is that it takes very long for large datasets.

The dataframe has colums similar to

label prediction proba_label1 proba_label2 proba_label3
label1 label2 0.3 0.6 0.1

In this case since pred_label2 is < 0.9, the value of colum "prediction" should change to "label1"

for i, row in df.iterrows():
    pred_label = row['prediction']
    proba_label = 'proba_' + pred_label
    probability = row[proba_label]
    if probability <= 0.9:
        df.at[i, 'prediction'] = row['label']

Example DF

data = {'host': ['A','B','A'],
       'label': ['label1', 'label2', 'label1'],
       'prediction': ['label1', 'label3', 'label3'],
       'proba_label1': [0.9, 0.03, 0.2],
       'proba_label3': [0.1, 0.95, 0.75],
       'proba_label2': [0, 0.02, 0.05]
       }

df = pd.DataFrame(data)
Chris
  • 3,581
  • 8
  • 30
  • 51
  • FYI [Pandas itertuples() Is Faster Than iterrows() and How To Make It Even Faster](https://medium.com/swlh/why-pandas-itertuples-is-faster-than-iterrows-and-how-to-make-it-even-faster-bc50c0edd30d) – DarrylG May 14 '21 at 09:57

2 Answers2

3

From the example data and the probable context (a machine learning model with a softmax function for classification), it is clear that the initial prediction is always the label with the highest probability.

You can exploit that fact to avoid any loops or lookups:

proba_max = np.max([df.proba_label1, df.proba_label2, df.proba_label3], axis=0)
df['prediction'] = np.where(proba_max <= 0.9, df['label'], df['prediction'])
Arne
  • 9,990
  • 2
  • 18
  • 28
  • I like the idea since it is true that the prediction is always the highest probability of the proba_* columns. – Chris May 14 '21 at 10:53
  • 2
    Thanks for clarifying Chris. @jezrael, the context is probably a machine learning model with a softmax function for classification. – Arne May 14 '21 at 10:55
  • Correct. I should have maybe stated that fact. sorry for the confusion. – Chris May 14 '21 at 10:55
  • @Chris - hmmm, OK, no idea about context. – jezrael May 14 '21 at 10:58
2

If relevant probability is always the largest one use max with only proba_ columns:

df['prediction'] = np.where(df.filter(like='proba_').max(axis=1) <= 0.9, 
                            df['label'], 
                            df['prediction'])

Use melting with select by columns names (instead lookup) and then set new values in numpy.where:

melt = df.melt(['label','prediction'], ignore_index=False)
df['val'] = melt.loc['proba_' + melt['prediction'] == melt['variable'], 'value']

df['prediction'] = np.where(df['val'] <= 0.9, df['label'], df['prediction'])
print (df)
  host   label prediction  proba_label1  proba_label3  proba_label2   val
0    A  label1     label1          0.90          0.10          0.00   0.9
1    B  label2     label3          0.03          0.95          0.02  0.95
2    A  label1     label1          0.20          0.75          0.05  0.75

Solution without helper column:

melt = df.melt(['label','prediction'], ignore_index=False)
s = melt.loc['proba_' + melt['prediction'] == melt['variable'], 'value']

df['prediction'] = np.where(s <= 0.9, df['label'], df['prediction'])

#if some labels not match this is safer like np.where
#df.loc[s <= 0.9, 'prediction'] = df['label']
print (df)
  host   label prediction  proba_label1  proba_label3  proba_label2
0    A  label1     label1          0.90          0.10          0.00
1    B  label2     label3          0.03          0.95          0.02
2    A  label1     label1          0.20          0.75          0.05

Performance:

data = {'host': ['A','B','A'],
       'label': ['label1', 'label2', 'label1'],
       'prediction': ['label1', 'label3', 'label3'],
       'proba_label1': [0.9, 0.03, 0.2],
       'proba_label3': [0.1, 0.95, 0.75],
       'proba_label2': [0, 0.02, 0.05]
       }

df = pd.DataFrame(data)
#[30000 rows
df = pd.concat([df] * 10000, ignore_index=True)

#deleted answer by @Nk03
In [85]: %timeit df.apply( lambda x: x['label'] if x[f"proba_{x['prediction']}"] <= 0.9 else x['prediction'], 1)
455 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [86]: %timeit df.apply(fun, axis=1)
482 ms ± 58.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [87]: %%timeit
    ...: melt = df.melt(['label','prediction'], ignore_index=False)
    ...: df['val'] = melt.loc['proba_' + melt['prediction'] == melt['variable'], 'value']
    ...: 
    ...: df['prediction'] = np.where(df['val'] <= 0.9, df['label'], df['prediction'])
    ...: 
72.2 ms ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I have added an example dataframe in my question. With your code row 2 is flipped but I don't expect it to flip. I only expect row 3 to change prediction to label1 – Chris May 14 '21 at 10:09
  • 1
    You are right. maybe did not refresh properly – Chris May 14 '21 at 10:18