How to transform some columns only with SimpleImputer or equivalent

Question

I am taking my first steps with scikit library and found myself in need of backfilling only some columns in my data frame.

I have read carefully the documentation but I still cannot figure out how to achieve this.

To make this more specific, let's say I have:

A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]

And that I would like to fill in the second column with the mean but not the third. How can I do this with SimpleImputer (or another helper class)?

An evolution from this, and the natural follow up questions is: how can I fill the second column with the mean and the last column with a constant (only for cells that had no values to begin with, obviously)?

Take a look at [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) — Dan, Aug 13 '19 at 10:34

liakoyras · Accepted Answer · 2021-06-05T18:49:42.377

16

There is no need to use the SimpleImputer.
DataFrame.fillna() can do the work as well

For the second column, use

column.fillna(column.mean(), inplace=True)
For the third column, use

column.fillna(constant, inplace=True)

Of course, you will need to replace column with your DataFrame's column you want to change and constant with your desired constant.

Edit
Since the use of inplace is discouraged and will be deprecated, the syntax should be

column = column.fillna(column.mean())

edited Jun 05 '21 at 18:49

answered Aug 13 '19 at 11:29

liakoyras

1,101
12
27

1

That is really neat and clean. It makes me wonder why bother with SimpleImputter at all. Thank you! – quiet-ranger Aug 13 '19 at 18:32
pandas is really awesome for stuff like this – liakoyras Aug 13 '19 at 18:39
1

keep in mind that `inplace` will be deprecated in Pandas 2.0 so it might be best to explicitly assign. – LazyEval Jun 05 '21 at 14:26
Question asked for use of SimpleImputer or other classes. This answer can not be chained into a pipeline. – gsandhu Feb 21 '23 at 18:45
@gsandhu there is no mention of pipelines or classes in the original question. – liakoyras Feb 22 '23 at 11:52

score 10 · Answer 2 · answered Aug 14 '19 at 01:39

Following Dan's advice, an example of using ColumnTransformer and SimpleImputer to backfill the columns is:

import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]

column_trans = ColumnTransformer(
[('imp_col1', SimpleImputer(strategy='mean'), [1]),
 ('imp_col2', SimpleImputer(strategy='constant', fill_value=29), [2])],
remainder='passthrough')

print(column_trans.fit_transform(A)[:, [2,0,1]])
# [[7 2.0 3]
#  [4 3.5 6]
#  [10 5.0 29]]

This approach helps with constructing pipelines which are more suitable for larger applications.

score 2 · Answer 3 · answered Jan 22 '20 at 13:03

This is methode I use, you can replace low_cardinality_cols by cols you want to encode. But this works also justt set value unique to max(df.columns.nunique()).

#check cardinalité des cols a encoder
low_cardinality_cols = [cname for cname in df.columns if df[cname].nunique() < 16 and 
                        df[cname].dtype == "object"]

Why thes columns, it's recommanded, to encode only cols with cardinality near 10.

# Replace NaN, if not you'll stuck
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # feel free to use others strategy
df[low_cardinality_cols]  = imp.fit_transform(df[low_cardinality_cols])

# Apply label encoder 
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in low_cardinality_cols:
    df[col] = label_encoder.fit_transform(df[col])
    ```

score 2 · Answer 4 · answered Jun 05 '21 at 14:20

I am assuming you have your data as a pandas dataframe.

In this case, all you need to do to use the SimpleImputer from scikitlearn is to pick the specific column your looking to impute nan's using say using the 'most_frequent' values, convert it to a numpy array and reshape into a column vector.

An example of this is,

## Imputing the missing values, we fill the missing values using the 'most_frequent'
# We are using the california housing dataset in this example
housing = pd.read_csv('housing.csv')
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
#Simple imputer expects a column vector, so converting the pandas Series
housing['total_bedrooms'] = imp.fit_transform(housing['total_bedrooms'].to_numpy().reshape(-1,1))

Similarly, you can pick any column in your dataset convert into a NumPy array, reshape it and use the SimpleImputer

How to transform some columns only with SimpleImputer or equivalent

4 Answers4