Let's suppose the following dataset
code | category | energy | sugars | proteins | |
---|---|---|---|---|---|
0 | 01 | B | 936 | NaN | 7.8 |
1 | 02 | NaN | NaN | 15.0 | NaN |
2 | 03 | A | 1569.0 | 23 | 4.1 |
3 | 04 | NaN | 826 | NaN | 3 |
4 | 05 | B | 1345 | 22 | 5.1 |
5 | 06 | A | NaN | 17 | NaN |
6 | 10 | C | 826 | NaN | 3 |
7 | 11 | C | 1345 | 26 | 5.1 |
8 | 101 | B | NaN | 18 | 6.1 |
9 | 102 | B | 636 | NaN | 7.8 |
10 | 103 | NaN | NaN | 15.0 | NaN |
11 | 104 | A | 1569.0 | 23 | 4.1 |
12 | 105 | C | 813 | NaN | 3.5 |
I would like to make the imputation with SimpleImputer considering the column category
.
Namely, I would like to assign the mean considering the product's category
.
If the product doesn't have a category, so, I would like to consider the mean of products without category
.
So, to complete sugar for code
01.
I am only going to consider all sugars
of products with category
B
code | category | energy | sugars | proteins | |
---|---|---|---|---|---|
0 | 01 | B | 936 | NaN | 7.8 |
4 | 05 | B | 1345 | 22 | 5.1 |
8 | 101 | B | NaN | 18 | 6.1 |
9 | 102 | B | 636 | NaN | 7.8 |
I did something similar, as I show below. But I need to do it with SimpleImputer.
To clarify, in the case below, I completed the NaN without category
with the mean of the column.
for col in df.columns:
if df[col].dtypes == "float64":
df.loc[df[col].isna() & df["category"].notnull(), col] = df["categories"].map(df.groupby("category")[col].mean())
df[col].fillna(df[col].mean(), inplace=True)