OneHot vectors with feature names

Question

Looking at the documentation of the OneHotEncoder there doesn't seem to be a way to include the feature names as a prefix of the OneHot vectors. Does anyone know of a way around this? Am I missing something?

Sample dataframe:

df = pd.DataFrame({'a':['c1', 'c1', 'c2', 'c1', 'c3'], 'b':['c1', 'c4', 'c1', 'c1', 'c1']})

from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder()
onehot.fit(df)

onehot.get_feature_names()
array(['x0_c1', 'x0_c2', 'x0_c3', 'x1_c1', 'x1_c4'], dtype=object)

Where given that the encoder is fed a dataframe I'd expect the possibility to obtain something like:

array(['a_c1', 'a_c2', 'a_c3', 'b_c1', 'b_c4'], dtype=object)

Scott Boston · Accepted Answer · 2019-11-05T14:20:01.390

4

Here is what you need to do to include your feature names from get_feature_name.

onehot.get_feature_names(input_features=df.columns)

Output:

array(['a_c1', 'a_c2', 'a_c3', 'b_c1', 'b_c4'], dtype=object)

Per docs:

get_feature_name(self, input_features=None)
Return feature names for output features.

Parameters: input_features : list of string, length n_features, optional String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.

Returns: output_feature_names : array of string, length n_output_features

edited Nov 05 '19 at 14:20

answered Nov 05 '19 at 14:17

Scott Boston

147,308
15
139
187

1

Ahh so there is a way! Thanks @scott :) – yatu Nov 05 '19 at 14:19
For the exact reason, I have updated the example in documentation. Look at the dev version of the documentation [here](https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html) – Venkatachalam Nov 06 '19 at 11:29

score 0 · Answer 2 · edited Aug 04 '20 at 15:53

Let's create a dataframe with 3 columns, each having some categorical values.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df_dict= {'Sex' :['m', 'f' ,'m' ,'f'] , 'City' : ['C1' , 'C2' , 'C3' , 'C4'] , 'States' :['S1' , 'S2', 'S3', 'S4']}
df = pd.DataFrame.from_dict(df_dict)
cat_enc = OneHotEncoder(handle_unknown = 'ignore')
transformed_array = cat_enc.fit_transform(df).toarray()
transformed_df = pd.DataFrame(transformed_array , columns= cat_enc.get_feature_names(df.columns))
transformed_df.head()

We will get the following output -

City_C1 City_C2 City_C3 City_C4 Sex_f   Sex_m   States_S1   States_S2   States_S3   States_S4
0   1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
1   0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
2   0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
3   0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0

OneHot vectors with feature names

2 Answers2