20

I am trying to create an sklearn pipeline with 2 steps:

  1. Standardize the data
  2. Fit the data using KNN

However, my data has both numeric and categorical variables, which I have converted to dummies using pd.get_dummies. I want to standardize the numeric variables but leave the dummies as they are. I have been doing this like this:

X = dataframe containing both numeric and categorical columns
numeric = [list of numeric column names]
categorical = [list of categorical column names]
scaler = StandardScaler()
X_numeric_std = pd.DataFrame(data=scaler.fit_transform(X[numeric]), columns=numeric)
X_std = pd.merge(X_numeric_std, X[categorical], left_index=True, right_index=True)

However, if I were to create a pipeline like:

pipe = sklearn.pipeline.make_pipeline(StandardScaler(), KNeighborsClassifier())

It would standardize all of the columns in my DataFrame. Is there a way to do this while standardizing only the numeric columns?

TTT
  • 6,505
  • 10
  • 56
  • 82
Nate Hutchinson
  • 349
  • 1
  • 2
  • 6
  • 1
    Possible duplicate of [Feature preprocessing of both continuous and categorical variables (of integer type) with scikit-learn](https://stackoverflow.com/questions/43554821/feature-preprocessing-of-both-continuous-and-categorical-variables-of-integer-t) – AlCorreia Feb 07 '18 at 21:35

5 Answers5

22

UPD: 2021-05-10

For sklearn >= 0.20 we can use sklearn.compose.ColumnTransformer

Here is a small example:

imports and data loading

# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause

import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

pipeline-aware data preprocessing using ColumnTransformer:

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

classification

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

OLD Answer:

Assuming you have the following DF:

In [163]: df
Out[163]:
     a     b    c    d
0  aaa  1.01  xxx  111
1  bbb  2.02  yyy  222
2  ccc  3.03  zzz  333

In [164]: df.dtypes
Out[164]:
a     object
b    float64
c     object
d      int64
dtype: object

you can find all numeric columns:

In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]

In [166]: num_cols
Out[166]: Index(['b', 'd'], dtype='object')

In [167]: df[num_cols]
Out[167]:
      b    d
0  1.01  111
1  2.02  222
2  3.03  333

and apply StandardScaler only to those numeric columns:

In [168]: scaler = StandardScaler()

In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])

In [170]: df
Out[170]:
     a         b    c         d
0  aaa -1.224745  xxx -1.224745
1  bbb  0.000000  yyy  0.000000
2  ccc  1.224745  zzz  1.224745

now you can "one hot encode" categorical (non-numeric) columns...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
10

I would use FeatureUnion. I then usually do something like that, assuming you dummy-encode your categorical variables also within the pipeline instead of before with Pandas:

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]

numeric = [list of numeric column names]
categorical = [list of categorical column names]

pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=numeric),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=categorical),OneHotEncoder(sparse=False)))
    ])),
    ('model', KNeighborsClassifier())
])

You could further check out Sklearn Pandas, which is also interesting.

Marcus V.
  • 6,323
  • 1
  • 18
  • 33
  • Hey Marcus thanks for your post here. So how would you use this "pipe" on training and testing data? pipe.fit(X_train, y_train)? But in that case the encoders' fit_tranform step would be left out. But if I use fit_transform, then the model fitting part would be left out. – DanZimmerman Dec 10 '18 at 20:33
  • You can use it as any estimator and call first ```pipe.fit(X_train, y_train)```. It will call all the```fit_transform()``` calls of the ```TransformerMixin``` and then the ```fit()``` of the last step, the estimator. If you then use it for predictions it will apply also all the transforms. This works also automatically within [model selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) classes. – Marcus V. Dec 11 '18 at 08:24
  • Thanks Marcus this is very helpful. – DanZimmerman Dec 11 '18 at 18:19
  • in 1st part of your pipeline, I want to apply StandardScaler on numerical columns and do nothing and proceed to next step in the pipeline, which will be the next step after the `FeatureUnion`. I tried skipping it, but next step only got the standard scaled columns. How do i do this? – Naveen Reddy Marthala Oct 08 '20 at 08:34
  • 1
    @NaveenKumar I don't get your question completely. Currently the output of `Columns(names=numeric)` is is passed to `StandardScaler()`, then joined with the output of the `OneHotEncoder()` and passed to the `KNeighborsClassifier()` – Marcus V. Oct 12 '20 at 11:52
  • Sorry that I haven't been clear enough for you to be able to understood me. Let's say I have defined a feature engineering step(a function or method or a class) , that takes only 2 columns from my 100 column data set and generates 10 columns and leaves the rest untouched. The problem with putting that step in the pipeline would be, that only the columns that this step generate will be passed on to the next step, discarding all the remaining columns. – Naveen Reddy Marthala Oct 12 '20 at 13:34
  • The way I usually do this(but want to include in a piepleine) is, I would first generate the 10 columns using the 2 original columns. then right join the 10 columns to the original dataset and drop the 2 original columns, leaving me with 108 columns(100 + 10 - 2). How do I accomplish this? One possible example would be count vectorised columns from text columns. – Naveen Reddy Marthala Oct 12 '20 at 13:34
  • All I mean to ask is, in the pipeline you have defined, every step operates on all the columns, what if a step operates only on some and all the remaining columns doesn't need to be dropped. from what i have read on documentation to learn pipelines, each step takes as input, the output of previous step. And if a step has been defined to operate only on some columns, it will discard the rest, which is not ideal. And there may even be some steps that operate on some columns that do not necessarily need to be dropped after having generated features from it. I hope I have been clear. – Naveen Reddy Marthala Oct 12 '20 at 13:48
  • 1
    @NaveenKumar Check out the `ColumnTransformer`, e.g., [this](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#use-columntransformer-by-selecting-column-by-names) example. For `ColumnTransformer` note the property `remainder` which you can use to model what happens with the columns that are not named (drop them or pass them). – Marcus V. Oct 13 '20 at 15:17
  • Thanks @MarcusV. I will check that out. – Naveen Reddy Marthala Oct 13 '20 at 16:02
2

Since you have converted your categorical features into dummies using pd.get_dummies, so you don't need to use OneHotEncoder. As a result, your pipeline should be:

from sklearn.preprocessing import StandardScaler,FunctionTransformer
from sklearn.pipeline import Pipeline,FeatureUnion

knn=KNeighborsClassifier()

pipeline=Pipeline(steps= [
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', FunctionTransformer(lambda data: data[:, cat_indices])),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', FunctionTransformer(lambda data: data[:, num_indices])),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', knn)
    ]
)
ebrahimi
  • 912
  • 2
  • 13
  • 32
1

Making the answer of MaxU - stop WAR against UA more general for any columns:

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

# Preprocessing Step
numeric_features = Xtrain.select_dtypes(include=['int64','float64']).columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = Xtrain.select_dtypes(exclude=['int64','float64']).columns
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


# Training Step
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

clf.fit(Xtrain, Ytrain)
print("model score: %.3f" % clf.score(Xtest, Ytest))
YazanGhafir
  • 464
  • 3
  • 8
0

Another way of doing this would be

import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = pd.DataFrame()
df['col1'] = np.random.randint(1,20,10)
df['col2'] = np.random.randn(10)
df['col3'] = list(5*'Y' + 5*'N')
numeric_cols = list(df.dtypes[df.dtypes != 'object'].index)
df.loc[:,numeric_cols] = scaler.fit_transform(df.loc[:,numeric_cols])
sushmit
  • 4,369
  • 2
  • 35
  • 38