32

Here is my question, I hope someone can help me to figure it out..

To explain, there are more than 10 categorical columns in my data set and each of them has 200-300 categories. I want to convert them into binary values. For that I used first label encoder to convert string categories into numbers. The Label Encoder code and the output is shown below.

https://i.stack.imgur.com/MIVHV.png

After Label Encoder, I used One Hot Encoder From scikit-learn again and it is worked. BUT THE PROBLEM IS, I need column names after one hot encoder. For example, column A with categorical values before encoding. A = [1,2,3,4,..]

It should be like that after encoding,

A-1, A-2, A-3

Anyone know how to assign column names to (old column names -value name or number) after one hot encoding. Here is my one hot encoding and it's output;

https://i.stack.imgur.com/kgrNa.png

I need columns with name because I trained an ANN, but every time data comes up I cannot convert all past data again and again. So, I want to add just new ones every time. Thank anyway..

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
Aditya Pratama
  • 657
  • 1
  • 8
  • 21
  • 8
    Please, [DO NOT use images of code](https://stackoverflow.com/help/minimal-reproducible-example). *Copy the actual text from your code editor, paste it into the question, then format it as code. This helps others more easily read and test your code*. – sentence May 28 '19 at 11:06

5 Answers5

35

You can get the column names using .get_feature_names() attribute.

>>> ohenc.get_feature_names()
>>> x_cat_df.columns = ohenc.get_feature_names()

Detailed example is here.

Update

from Version 1.0, use get_feature_names_out

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • 2
    `get_feature_names` is deprecated in scikit-learn 1.2, use `get_feature_names_out` instead – Lfppfs Dec 16 '21 at 18:54
  • 2
    Thanks. I think it was deprecated in v1.0. [reference](https://scikit-learn.org/stable/whats_new/v1.0.html#version-1-0-0). BTW, 1.2 version is not released yet! – Venkatachalam Dec 18 '21 at 12:12
  • just a little addition to @Venkatachalam's answer, ```ohe_cols = ['GIRO_BRACKET_CODE','ASSET_SIZE_BRACKET_CODE']; ohe_cols_new = [f.replace(f.split('_')[0], ohe_cols[int(f.split('_')[0][1:])]) for f in ohe.get_feature_names()] ``` – Alper Aydın Sep 24 '22 at 11:31
25

This example could help for future readers:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

train_X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})
>>>
     Sex     AgeGroup
0    male         0
1  female        15
2    male        30
3  female        45
4    male        60
5  female        75
encoder=OneHotEncoder(sparse=False)

train_X_encoded = pd.DataFrame (encoder.fit_transform(train_X[['Sex']]))

train_X_encoded.columns = encoder.get_feature_names(['Sex'])

train_X.drop(['Sex'] ,axis=1, inplace=True)

OH_X_train= pd.concat([train_X, train_X_encoded ], axis=1)
>>>
    AgeGroup  Sex_female  Sex_male
0         0         0.0       1.0
1        15         1.0       0.0
2        30         0.0       1.0
3        45         1.0       0.0
4        60         0.0       1.0
5        75         1.0       0.0`
Gander
  • 1,854
  • 1
  • 23
  • 30
Lucas Bensaid
  • 351
  • 3
  • 3
1

Hey I had the same problem whereby I had a custom Estimator which extended the BaseEstimator Class from Sklearn.base

I added a class attribute into the init called self.feature_names then as a last step in the transform method just updated self.feature_names with the columns from the result.

from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

class CustomOneHotEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, **kwargs):
        self.feature_names = []

    def fit(self, X, y=None):
        return self

    def transform(self, X):

        result = pd.get_dummies(X)
        self.feature_names = result.columns

        return result

A bit basic I know but it does the job I need it to.

If you want to retrieve the column names for the feature importances from your sklearn pipeline you can get the features from the classifier step and the column names from the one hot encoding step.

a = model.best_estimator_.named_steps["clf"].feature_importances_
b = model.best_estimator_.named_steps["ohc"].feature_names

df = pd.DataFrame(a,b)
df.sort_values(by=[0], ascending=False).head(20)

J.Bolshaw
  • 23
  • 3
0

There is another easy way with the package category_encoders this method uses a pipeline which also is one of the data science best practices.

import pandas as pd
from category_encoders.one_hot import OneHotEncoder

X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})

ohe = OneHotEncoder(use_cat_names=True)
ohe.fit_transform(X)

Carlos Mougan
  • 761
  • 1
  • 8
  • 21
0

Update: based on the answer of @Venkatachalam, the method get_feature_names() has been deprecated in scikit-learn 1.0. You will get a warning when trying to run it. Instead, use get_feature_names_out():

import pandas as pd
from category_encoders.one_hot import OneHotEncoder

ohenc = OneHotEncoder(sparse=False)
x_cat_df = pd.DataFrame(ohenc.fit_transform(xtrain_lbl))
x_cat_df.columns = ohenc.get_feature_names_out(input_features=xtrain_lbl.columns)

Setting the parameter sparse=False in OneHotEncoder() will return an array instead of sparse matrix, so you don't need to convert it later. fit_transform() will calculate the parameters and transform the training set in one line.

Source: OneHotEncoder documentation

Meng
  • 11
  • 3