Fill NaN with mean of a group for each column

Question

I Know that the fillna() method can be used to fill NaN in whole dataframe.

df.fillna(df.mean()) # fill with mean of column.

How to limit mean calculation to the group (and the column) where the NaN is.

Exemple:

import pandas as pd 
import numpy as np 

df = pd.DataFrame({
    'a': pd.Series([1,1,1,2,2,2]),
    'b': pd.Series([1,2,np.NaN,1,np.NaN,4])
})

print df

Input

Output (after groupby('a') & replace NaN by mean of group)

this output is just an exemple, but say you have many NaNs in different other columns b, c, d, etc... — Ghilas BELHADJ, Nov 30 '15 at 16:48
In the future it would be useful to post your complete requirements as it matters and it affects the answers — EdChum, Nov 30 '15 at 16:57

EdChum · Accepted Answer · 2015-11-30T16:54:12.470

IIUC then you can call fillna with the result of groupby on 'a' and transform on 'b':

In [44]:
df['b'] = df['b'].fillna(df.groupby('a')['b'].transform('mean'))
df

Out[44]:
   a    b
0  1  1.0
1  1  2.0
2  1  1.5
3  2  1.0
4  2  2.5
5  2  4.0

If you have multiple NaN values then I think the following should work:

In [47]:
df.fillna(df.groupby('a').transform('mean'))

Out[47]:
   a    b
0  1  1.0
1  1  2.0
2  1  1.5
3  2  1.0
4  2  2.5
5  2  4.0

EDIT

In [49]:
df = pd.DataFrame({
    'a': pd.Series([1,1,1,2,2,2]),
    'b': pd.Series([1,2,np.NaN,1,np.NaN,4]),
    'c': pd.Series([1,np.NaN,np.NaN,1,np.NaN,4]),
    'd': pd.Series([np.NaN,np.NaN,np.NaN,1,np.NaN,4])
})
df

Out[49]:
   a   b   c   d
0  1   1   1 NaN
1  1   2 NaN NaN
2  1 NaN NaN NaN
3  2   1   1   1
4  2 NaN NaN NaN
5  2   4   4   4

In [50]:
df.fillna(df.groupby('a').transform('mean'))

Out[50]:
   a    b    c    d
0  1  1.0  1.0  NaN
1  1  2.0  1.0  NaN
2  1  1.5  1.0  NaN
3  2  1.0  1.0  1.0
4  2  2.5  2.5  2.5
5  2  4.0  4.0  4.0

You get all NaN for 'd' as all values are NaN for group 1 for d

The answer on your edit, is what i'm looking for. thank you. — Ghilas BELHADJ, Nov 30 '15 at 16:56

score 0 · Answer 2 · answered Nov 30 '15 at 16:50

We first compute the group means, ignoring the missing values:

group_means = df.groupby('a')['b'].agg(lambda v: np.nanmean(v))

Next, we use groupby again, this time fetching the corresponding values:

df_new = df.groupby('a').apply(lambda t: t.fillna(group_means.loc[t['a'].iloc[0]]))

Fill NaN with mean of a group for each column

2 Answers2

Linked