Pandas sklearn one-hot encoding dataframe or numpy?

Question

How can I transform a pandas data frame to sklearn one-hot-encoded (dataframe / numpy array) where some columns do not require encoding?

mydf = pd.DataFrame({'Target':[0,1,0,0,1, 1,1],
                   'GroupFoo':[1,1,2,2,3,1,2],
                    'GroupBar':[2,1,1,0,3,1,2],
                    'GroupBar2':[2,1,1,0,3,1,2],
                    'SomeOtherShouldBeUnaffected':[2,1,1,0,3,1,2]})
columnsToEncode = ['GroupFoo', 'GroupBar']

Is an already label encoded data frame and I would like to only encode the columns marked by columnsToEncode?

My problem is that I am unsure if a pd.Dataframe or the numpy array representation are better and how to re-merge the encoded part with the other one.

My attempts so far:

myEncoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
myEncoder.fit(X_train)
df = pd.concat([
         df[~columnsToEncode], # select all other / numeric
        # select category to one-hot encode
         pd.Dataframe(encoder.transform(X_train[columnsToEncode]))#.toarray() # not sure what this is for
        ], axis=1).reindex_axis(X_train.columns, axis=1)

Notice: I am aware of Pandas: Get Dummies / http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html but that does not play well in a train / test split where I require such an encoding per fold.

It is not entirely clear to me why pre-one.hot encoding is a problem when using train/test split (as both sets probably need this encoding; so just do it before splitting). If it's really needed, it is probably doable with scikit-learn's pipelines (preprocessing automatically called before passed to classifier/regressor). Also: you can always use df.as_matrix() to extract a numpy-array. — sascha, Oct 07 '16 at 20:27
I will have to try feature union tomorrow. The point is I need to perform a certain preprocessing step per fold. — Georg Heiler, Oct 07 '16 at 20:33

Georg Heiler · Accepted Answer · 2016-10-08T09:38:10.330

6

This library provides several categorical encoders which make sklearn / numpy play nicely with pandas https://github.com/wdm0006/categorical_encoding

However, they do not yet support "handle unknown category"

for now I will use

myEncoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
myEncoder.fit(df[columnsToEncode])

pd.concat([df.drop(columnsToEncode, 1),
          pd.DataFrame(myEncoder.transform(df[columnsToEncode]))], axis=1).reindex()

As this supports unknown datasets. For now, I will stick with half-pandas half-numpy because of the nice pandas labels. for the numeric columns.

edited Oct 08 '16 at 09:38

answered Oct 08 '16 at 08:21

Georg Heiler

16,916
36
162
292

While doing myEncoder.fit(df["sales"]), I am getting error as ValueError: Expected 2D array, got 1D array instead: array=['ab' 'vg' 'ab' 'iu' 'ab' 'vg' 'iu']. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample. – Shashank Feb 19 '22 at 16:56

Matheus Schaly · Answer 2 · 2020-06-17T11:49:45.323

For One Hot Encoding I recommend using ColumnTransformer and OneHotEncoder instead of get_dummies. That's because OneHotEncoder returns an object which can be used to encode unseen samples using the same mapping that you used on your training data.

The following code encodes all the columns provided in the columns_to_encode variable:

import pandas as pd
import numpy as np

df = pd.DataFrame({'cat_1': ['A1', 'B1', 'C1'], 'num_1': [100, 200, 300], 
                   'cat_2': ['A2', 'B2', 'C2'], 'cat_3': ['A3', 'B3', 'C3'],
                   'label': [1, 0, 0]})

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
columns_to_encode = [0, 2, 3] # Change here
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns_to_encode)], remainder='passthrough')
X = np.array(ct.fit_transform(X))

X:

array([[1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 100],
       [0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 200],
       [0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 300]], dtype=object)

To avoid multicollinearity due to the dummy variable trap, I would also suggest removing one of the columns returned by each column that you encoded. The following code encodes all the columns provided in the columns_to_encode variable AND it removes the last column of each one hot encoded column:

import pandas as pd
import numpy as np

def sum_prev (l_in):
    l_out = []
    l_out.append(l_in[0])
    for i in range(len(l_in)-1):
        l_out.append(l_out[i] + l_in[i+1])
    return [e - 1 for e in l_out]

df = pd.DataFrame({'cat_1': ['A1', 'B1', 'C1'], 'num_1': [100, 200, 300], 
                   'cat_2': ['A2', 'B2', 'C2'], 'cat_3': ['A3', 'B3', 'C3'],
                   'label': [1, 0, 0]})

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
columns_to_encode = [0, 2, 3] # Change here
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns_to_encode)], remainder='passthrough')

columns_to_encode = [df.iloc[:, del_idx].nunique() for del_idx in columns_to_encode]
columns_to_encode = sum_prev(columns_to_encode)
X = np.array(ct.fit_transform(X))
X = np.delete(X, columns_to_encode, 1)

X:

array([[1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 100],
       [0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 200],
       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 300]], dtype=object)

score 0 · Answer 3 · answered Dec 06 '17 at 05:45

I believe that this update to the initial answer is even better in order t perform dummy coding import logging

import pandas as pd
from sklearn.base import TransformerMixin

log = logging.getLogger(__name__)


class CategoricalDummyCoder(TransformerMixin):
    """Identifies categorical columns by dtype of object and dummy codes them. Optionally a pandas.DataFrame
    can be returned where categories are of pandas.Category dtype and not binarized for better coding strategies
    than dummy coding."""

    def __init__(self, only_categoricals=False):
        self.categorical_variables = []
        self.categories_per_column = {}
        self.only_categoricals = only_categoricals

    def fit(self, X, y):
        self.categorical_variables = list(X.select_dtypes(include=['object']).columns)
        logging.debug(f'identified the following categorical variables: {self.categorical_variables}')

        for col in self.categorical_variables:
            self.categories_per_column[col] = X[col].astype('category').cat.categories
        logging.debug('fitted categories')
        return self

    def transform(self, X):
        for col in self.categorical_variables:
            logging.debug(f'transforming cat col: {col}')
            X[col] = pd.Categorical(X[col], categories=self.categories_per_column[col])
            if self.only_categoricals:
                X[col] = X[col].cat.codes
        if not self.only_categoricals:
            return pd.get_dummies(X, sparse=True)
        else:
            return X

Pandas sklearn one-hot encoding dataframe or numpy?

3 Answers3

Linked