Scikit learn imputer in a pandas dataframe group by id

Question

I have a pandas dataframe with the following data grouped by id and ordered by seq (note that some amounts are nan). I need to run scikit-learn imputer to impute the average by id.

from sklearn import impute
import pandas as pd
import numpy as np

rows  = [{'id': 1, 'seq': 0, 'amount': 2000 },
          {'id': 1, 'seq': 1, 'amount': 4000 },
          {'id': 1, 'seq': 2, 'amount': np.nan },
          {'id': 2, 'seq': 0, 'amount': 1000 },
          {'id': 2, 'seq': 1, 'amount': 3000 },
          {'id': 2, 'seq': 2, 'amount': np.nan }]

pdf = pd.DataFrame(rows)

imputer = impute.SimpleImputer(strategy='mean')

# If I run this it will ignore the id
pdf[['amount_imputed']] = imputer.fit_transform(pdf[['amount']])

The result of amount_imputed should be 3000 for id = 1 and 2000 for id = 2. Instead, the statement above fills both amounts with the total average, 5000. How to group the imputer by id?

Does this answer your question? [scikit-learn impute mean of feature within groups of nominal value in another feature](https://stackoverflow.com/questions/42724040/scikit-learn-impute-mean-of-feature-within-groups-of-nominal-value-in-another-fe). Also https://stackoverflow.com/q/67515224/10495893 — Ben Reiniger, Aug 19 '22 at 14:30
Not really, I'm looking for a groupby statement. Ynjxsjmh answer is almost there, I just need it to return a DataFrame instead of a Series, I don't know how to do that. — ps0604, Aug 19 '22 at 17:10

score 2 · Accepted Answer · answered Aug 19 '22 at 04:12

Let's try

pdf['amount_imputed'] = (pdf.groupby('id', group_keys=False)['amount']
                         .transform(lambda col:
                             imputer.fit_transform(col.to_frame()).flatten(),
                         ))
# or
pdf['amount_imputed'] = (pdf.groupby('id', group_keys=False)
                         .apply(lambda g: pd.Series(
                             imputer.fit_transform(g[['amount']]).flatten(),
                             index=g.index
                         )))

print(pdf)

   id  seq  amount  amount_imputed
0   1    0  2000.0          2000.0
1   1    1  4000.0          4000.0
2   1    2     NaN          3000.0
3   2    0  1000.0          1000.0
4   2    1  3000.0          3000.0
5   2    2     NaN          2000.0

Kovarthanan Kesavan · Answer 2 · 2022-08-19T04:33:54.697

1

I think, using SimpleImputer we can't solve this kind of problem like group based. But there are some alternatives.

you can try this :

# Replace amount NaN with mean amount of same id
pdf['amount_imputed'] = pdf.groupby('id').amount.transform(lambda x: x.fillna(x.mean()))
pdf.amount_imputed

Output :

0    2000.0
1    4000.0
2    3000.0
3    1000.0
4    3000.0
5    2000.0
Name: amount_imputed, dtype: float64

edited Aug 19 '22 at 04:33

answered Aug 19 '22 at 04:27

Kovarthanan Kesavan

21
5

Scikit learn imputer in a pandas dataframe group by id

2 Answers2