Python imputing values using median basis specific column value selection

Question

I want to impute some blank values with the median for my dataframe which looks like this :

ID Salary Position
1  10     VP
2         VP
3  5      VP
4  15     AVP
5  20     AVP
6         AVP

Now the blank salaries have to be replaced by the position level Median. For example : the blank salary for ID = 2 and position as VP should be imputed by the median of position VP which is 5 and the same blank for AVP should be imputed in a similar fashion.

I have used the following code but this is taking the full median and not the specific one at Position level :

impute_median=df['Salary'].median()
df['Salary']=df['Salary'].fillna(impute_median)

The output should look like this :

   ID Salary Position
   1      10     VP
   2      5      VP
   3      5      VP
   4      15     AVP
   5      20     AVP
   6      15     AVP

@ansev: you are right. I added the number 5 just for a representation purpose considering blank as 0 SO median of 0,5,10 will be 5. — Django0602, Feb 06 '20 at 08:17

ansev · Accepted Answer · 2020-02-05T17:00:05.547

To fill with median you should use:

df['Salary'] = df['Salary'].fillna(df.groupby('Position').Salary.transform('median'))
print(df)
   ID  Salary Position
0   1    10.0       VP
1   2     7.5       VP
2   3     5.0       VP
3   4    15.0      AVP
4   5    20.0      AVP
5   6    17.5      AVP

if you want to fill in with the closest to medium value (less)

df['Salary'] = df['Salary'].fillna(df.Salary.sub(df.groupby('Position')
                                    .Salary
                                    .transform('median'))
                           .where(lambda x: x.le(0))
                           .groupby(df['Position'])
                           .transform('idxmax')
                           .map(df['Salary']))
print(df)
0   1    10.0       VP
1   2     5.0       VP
2   3     5.0       VP
3   4    15.0      AVP
4   5    20.0      AVP
5   6    15.0      AVP

score 1 · Answer 2 · answered Feb 05 '20 at 16:40

1

Try this:

df['Salary']=df.groupby(['Position'])['Salary'].apply(lambda x:x.fillna(x.median()))

Essentially we apply a groupby on the position with respect to salary and then fillna with the median of each group.

answered Feb 05 '20 at 16:40

Edeki Okoh

1,786
15
27

https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code – ansev Feb 05 '20 at 17:26
In the event that your solution is something similar to what I have done and where more than one `groupby` sentences is required it could be use `apply` and it could only. In this solution that you proposes, it is much faster to use: `df['Salary'] = df['Salary'].fillna(df.groupby('Position').Salary.transform('median'))` – ansev Feb 05 '20 at 17:31

Python imputing values using median basis specific column value selection

2 Answers2