I have a pandas dataframe with the following data grouped by id and ordered by seq (note that some amounts are nan
). I need to run scikit-learn imputer to impute the average by id.
from sklearn import impute
import pandas as pd
import numpy as np
rows = [{'id': 1, 'seq': 0, 'amount': 2000 },
{'id': 1, 'seq': 1, 'amount': 4000 },
{'id': 1, 'seq': 2, 'amount': np.nan },
{'id': 2, 'seq': 0, 'amount': 1000 },
{'id': 2, 'seq': 1, 'amount': 3000 },
{'id': 2, 'seq': 2, 'amount': np.nan }]
pdf = pd.DataFrame(rows)
imputer = impute.SimpleImputer(strategy='mean')
# If I run this it will ignore the id
pdf[['amount_imputed']] = imputer.fit_transform(pdf[['amount']])
The result of amount_imputed should be 3000 for id = 1 and 2000 for id = 2. Instead, the statement above fills both amounts with the total average, 5000. How to group the imputer by id?