Boolean Operation on Pandas Dataframe Column Average - this has got to be simple

Question

I have at pandas dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A':[1,5,3],
                  'B': [4,2,6]})

df['avg'] = df.mean(axis=1)
df[df<df['avg']]

I would like keep all the values in the dataframe that are below the average value in column df['avg']. When I perform the below operation I am returned all NAN's

df[df<df['avg']]

If I set up a for loop I can get the boolean of what I want.

col_names = ['A', 'B']
for colname in col_names:
    df[colname] = df[colname]<df['avg']

What I am searching for would look like this:

df_desired = pd.DataFrame({
    'A':[1,np.nan,3],
    'B':[np.nan,2,np.nan],
    'avg' :[2.5, 3.5, 4.5]
})

How do I do this? There has to be a pythonic way to do this.

score 1 · Accepted Answer · answered Jan 26 '20 at 20:28

1

You can use .mask(..) [pandas-doc] here. We can use numpy's broadcasting to generate an array of booleans that are higher than the given average:

>>> df.mask(df.values > df['avg'].values[:,None])
     A    B  avg
0  1.0  NaN  2.5
1  NaN  2.0  3.5
2  3.0  NaN  4.5

answered Jan 26 '20 at 20:28

Willem Van Onsem

443,496
30
428
555

Thank you. I still have some work in understanding what you did but I'm ecstatic you gave me what I needed. Thank you. By adding ".values" to the dataframe ( df.values) you created a numpy array. By adding it to df['avg].values you created a second numpy array. What did you do by adding the .values[:,None]? I notice that the None corresponds to a column. – getaglow Jan 26 '20 at 20:51
Note that using `.values` is now discouraged, in favour of `.to_numpy()`. Its use here is in fact entirely unnecessary. – AMC Jan 27 '20 at 00:33
@getaglow _What did you do by adding the .values[:,None]? I notice that the None corresponds to a column._ I'm not the one who wrote the answer, but you can find out more [here](https://stackoverflow.com/q/29241056/11301900). – AMC Jan 27 '20 at 00:36

score 1 · Answer 2 · answered Jan 27 '20 at 00:19

I think this is somewhat more idiomatic, and clearer, than the accepted solution:

import numpy as np
import pandas as pd

df = pd.DataFrame({'A': [1, 5, 3],
                   'B': [4, 2, 6]})

print(df)

df['avg'] = df.mean(axis=1)

print(df)

df[df[['A', 'B']].ge(df['avg'], axis=0)] = np.NaN

print(df)

Output:

   A  B
0  1  4
1  5  2
2  3  6
   A  B  avg
0  1  4  2.5
1  5  2  3.5
2  3  6  4.5
     A    B  avg
0  1.0  NaN  2.5
1  NaN  2.0  3.5
2  3.0  NaN  4.5

Speaking of the accepted solution, it is no longer recommended to use .values in order to convert a Pandas DataFrame or Series to a NumPy Array. Fortunately, we don't actually need to use it at all here:

df.mask(df > df['avg'][:, np.newaxis])

Boolean Operation on Pandas Dataframe Column Average - this has got to be simple

2 Answers2