1

I've got a pandas DataFrame filled with real numbers and categories, but there is a few nan values in it.

How can I replace the nans with mean or median of grouped categories?

      A         B      
0  model 2    0.979728 
1  model 1    0.912674 
2  model 2    0.540679 
3  model 1    2.027325 
4  model 2        NaN  
5  model 1        NaN  
6  model 3   -0.612343 
7  model 1   1.033826  
8  model 1   1.025011  
9  model 2   -0.795876 

in this case i would like to substitute two nan with their relative mean or median.

Thank you in advance

PdF
  • 77
  • 1
  • 13

1 Answers1

2

You can use groupby + transform + fillna:

>>> df['B'] = df.B.fillna(df.groupby('A')['B'].transform('mean'))                                                                                                                                                          
>>> df                                                                                                                                                                                                                        

        A         B
0 model 2  0.979728
1 model 1  0.912674
2 model 2  0.540679
3 model 1  2.027325
4 model 2  0.241510
5 model 1  1.249709
6 model 3 -0.612343
7 model 1  1.033826
8 model 1  1.025011
9 model 2 -0.795876
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • very good thx but in this way i have to impute manually the mean, but my dataset is very big so the effort in this way will be very high. Is it possible to use a groupby like this? group_data_median = df.groupby(['A'])['B'].median() # sum function – PdF Nov 16 '18 at 13:51