One-Hot-Encode categorical variables and scale continuous ones simultaneouely

Question

I'm confused because it's going to be a problem if you first do OneHotEncoder and then StandardScaler because the scaler will also scale the columns previously transformed by OneHotEncoder. Is there a way to perform encoding and scaling at the same time and then concatenate the results together?

OneHotEncoder has a parameter `categorical_features` to specify the columns to encode. And you can use the [FeatureUnion](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) to do both the things separately and then merge them together. — Vivek Kumar, May 05 '17 at 07:25

score 36 · Accepted Answer · edited Nov 16 '19 at 01:45

Sure thing. Just separately scale and one-hot-encode the separate columns as needed:

# Import libraries and download example data
from sklearn.preprocessing import StandardScaler, OneHotEncoder

dataset = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
print(dataset.head(5))

# Define which columns should be encoded vs scaled
columns_to_encode = ['rank']
columns_to_scale  = ['gre', 'gpa']

# Instantiate encoder/scaler
scaler = StandardScaler()
ohe    = OneHotEncoder(sparse=False)

# Scale and Encode Separate Columns
scaled_columns  = scaler.fit_transform(dataset[columns_to_scale]) 
encoded_columns =    ohe.fit_transform(dataset[columns_to_encode])

# Concatenate (Column-Bind) Processed Columns Back Together
processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)

This principles outlined in this answer, also work for Pyspark — Chuck, Mar 12 '20 at 17:25

NiYanchun · Answer 2 · 2018-10-11T12:45:21.243

Scikit-learn from version 0.20 provides sklearn.compose.ColumnTransformer to do Column Transformer with Mixed Types. You can scale the numeric features and one-hot encode the categorical ones together. Below is the offical example(you can find the code here ):

# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause

from __future__ import print_function

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

Caution: this method is EXPERIMENTAL, some behaviors may change between releases without deprecation.

edesz · Answer 3 · 2020-09-26T20:05:37.797

There are presently numerous methods to achieve the outcome required by the OP. 3 ways to do this are

np.concatenate() - see this answer to the OP's question, already posted
scikit-learn's ColumnTransformer
- originally suggested in this SO answer to the OP's question
scikit-learn's FeatureUnion
- also shown in this SO answer

Using the example posted by @Max Power here, below is a minimum working snippet that does what the OP is looking for and brings together the transformed columns into a single Pandas dataframe. The output of all 3 approaches is shown

The common code for all 3 methods is

import numpy as np
import pandas as pd

# Import libraries and download example data
from sklearn.preprocessing import StandardScaler, OneHotEncoder

dataset = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

# Define which columns should be encoded vs scaled
columns_to_encode = ['rank']
columns_to_scale  = ['gre', 'gpa']

# Instantiate encoder/scaler
scaler = StandardScaler()
ohe    = OneHotEncoder(sparse=False)

Method 1. see code here. To show the output, can use

print(pd.DataFrame(processed_data).head())

Output of Method 1.

          0         1    2    3    4    5
0 -1.800263  0.579072  0.0  0.0  1.0  0.0
1  0.626668  0.736929  0.0  0.0  1.0  0.0
2  1.840134  1.605143  1.0  0.0  0.0  0.0
3  0.453316 -0.525927  0.0  0.0  0.0  1.0
4 -0.586797 -1.209974  0.0  0.0  0.0  1.0

Method 2.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


p = Pipeline(
    [("coltransformer", ColumnTransformer(
        transformers=[
            ("assessments", Pipeline([("scale", scaler)]), columns_to_scale),
            ("ranks", Pipeline([("encode", ohe)]), columns_to_encode),
        ]),
    )]
)

print(pd.DataFrame(p.fit_transform(dataset)).head())

Output of Method 2.

          0         1    2    3    4    5
0 -1.800263  0.579072  0.0  0.0  1.0  0.0
1  0.626668  0.736929  0.0  0.0  1.0  0.0
2  1.840134  1.605143  1.0  0.0  0.0  0.0
3  0.453316 -0.525927  0.0  0.0  0.0  1.0
4 -0.586797 -1.209974  0.0  0.0  0.0  1.0

Method 3.

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion


class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
    def fit(self, x, y=None):
        return self
    def transform(self, df):
        return df[self.key]

p = Pipeline([("union", FeatureUnion(
    transformer_list=[
        ("assessments", Pipeline([
            ("selector", ItemSelector(key=columns_to_scale)),
            ("scale", scaler)
            ]),
        ),
        ("ranks", Pipeline([
            ("selector", ItemSelector(key=columns_to_encode)),
            ("encode", ohe)
            ]),
        ),
    ]))
])

print(pd.DataFrame(p.fit_transform(dataset)).head())

Output of Method 3.

          0         1    2    3    4    5
0 -1.800263  0.579072  0.0  0.0  1.0  0.0
1  0.626668  0.736929  0.0  0.0  1.0  0.0
2  1.840134  1.605143  1.0  0.0  0.0  0.0
3  0.453316 -0.525927  0.0  0.0  0.0  1.0
4 -0.586797 -1.209974  0.0  0.0  0.0  1.0

Explanation

Method 1. is already explained.
Methods 2. and 3. accept the full dataset but only perform specific actions on subsets of the data. The modified/processed subsets are brought together (combined) into the final output.

Details

pandas==0.23.4
numpy==1.15.2
scikit-learn==0.20.0

Additional Notes

The 3 methods shown here are probably not the only possibilities....I am sure there are other methods to do this.

SOURCE USED

Updated link to binary.csv dataset

score 0 · Answer 4 · answered May 05 '17 at 07:14

0

Can't get your point as OneHotEncoder is used for nominal data, and StandardScaler is used for numeric data. So you shouldn't use them together for your data.

answered May 05 '17 at 07:14

Sraw

18,892
11
54
87

PLease tell me how to employ `OneHotEncoder` on nominal data (especially of **string** type). I need this feature badly as anyone else. – James Wong May 05 '17 at 07:59
You can use first a LabelEncoder then a OneHotEncoder: `import numpy as np winds=np.array([['SE'],['NW'],['NW'],['NE'],['SE']]) from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder int_encoded = LabelEncoder().fit_transform(winds[:,0]).reshape((len(winds),-1)) one_hot_encoded = OneHotEncoder(sparse=False).fit_transform(int_encoded) # you get this: array([[ 0., 0., 1.], [ 0., 1., 0.], [ 0., 1., 0.], [ 1., 0., 0.], [ 0., 0., 1.]])` – Thierry Herrmann May 21 '17 at 15:57

One-Hot-Encode categorical variables and scale continuous ones simultaneouely

4 Answers4

Linked