OneHotEncoding losing column identity for Lasso Regression

Question

I have a cleaned housing dataset with about 75 total features and 1 target variable. In order to use lasso regression for selecting the most relevant of the 75 features, I am only able to use label encoding for the categorical features, as it preserves column identity as follows:

# Label Encoding all other categorical features:

for x in categorical_features:
    labels_ordered=house_df.groupby([x])['SalePrice'].mean().sort_values().index  # SalePrice is target variable
    labels_ordered={k:i for i,k in enumerate(labels_ordered,0)}
    house_df[x]=house_df[x].map(labels_ordered)

# After splitting into train/test and fitting the lasso
feature_sel_model = SelectFromModel(Lasso(alpha=0.005, random_state=0))
feature_sel_model.fit(X_train, y_train)

# Checking the array of selected and rejected features
feature_sel_model.get_support()

O/P: array([ True,  True, False, False, False, False, False, False, False,
       False,  True, False, False, False, False,  True,  True, False,
        True, False, False, False, False, False, False, False, False,
        True,  True, False,  True, False,  True, False, False, False,
        True, False,  True,  True, False,  True, False, False,  True,
       False, False, False, False, False, False,  True, False, False,
        True, False, False, False,  True,  True,  True, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False,  True])


# Making a list of the selected features
selected_feat = X_train.columns[(feature_sel_model.get_support())]

# let's print some stats
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))

O/P: total features: 75
selected features: 22

The column identity is needed to use the output of lasso regression and remove the irrelevant features from the original dataset.

My problem is that the categorical features have multiple labels and not ordinal, so OneHotEncoding using sklearn would actually be the best method of encoding but would create a complex matrix, destroying column identity. How do I use the output of OHE (which is a np.arrray with all encoded variables brought to the left of the matrix) to feed to the lasso regressor? Or should I stick to label encoding?

Mark · Answer 1 · 2020-11-30T13:59:14.290

First of all, you should scale your numeric features when using Lasso for feature importance (I used MinMaxScaler in my example).

Using `pandas.get_dummies()`

# One Hot Encoding 
ohe_df = pd.get_dummies(house_df, columns=list_cat_of_cols)

# split into train/test and do other stuff
...

Using OneHotEncoder from sklearn

OneHotEncoder has a method get_feature_names() By calling ohe.get_feature_names(cat_cols), it will return labels for the encoded categorical columns.

I suggest reading documentation for any further explanation.

Example:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from sklearn.compose import ColumnTransformer

df = pd.DataFrame({'A1': ['a','a','b','a','c','b'],
                   'A2': ['x', 'y', 'y', 'y', 'x', 'x'],
                   'B': [1,2,3,1,5,2],
                   'C': [1.19,2.21,3.51,1.23,5.12,2.49]})
X = df.drop(columns=['C'])
y = df['C']

cat_cols = ['A1', 'A2']
other_cols = X.drop(columns=cat_cols).columns

ct = ColumnTransformer([('ohe', OneHotEncoder(sparse=False), cat_cols)], remainder=MinMaxScaler())
encoded_matrix = ct.fit_transform(X)

encoded_cols = ct.named_transformers_.ohe.get_feature_names(cat_cols)
all_features = np.concatenate([encoded_cols, other_cols])
print('all_features:', all_features)

feature_sel_model = SelectFromModel(Lasso(alpha=0.05))
feature_sel_model.fit(encoded_matrix, y)
feature_mask = feature_sel_model.get_support()
print('selected_features:', all_features[feature_mask])

Output:

all_features: ['A1_a' 'A1_b' 'A1_c' 'A2_y' 'B']
selected_features: ['A1_b' 'B']

You should use OneHotEncoder in case of using the same encoder on the test data. More info here: https://stackoverflow.com/a/56567037/7623492

Hi Mark - thank you. I have rarely used getdummies as various sources I referred to advice against that and recommend sklearn.ohe for "scalable deployment". What do you think? — Anish, Nov 29 '20 at 22:47

score 0 · Answer 2 · answered Nov 30 '20 at 15:12

If for example a particular column has categories A,B,C and D and this will be expanded to 4 columns, 0/1 for A, 0/1 for B and so on. After running the regression if for example A and B are dropped (having coefficient 0), it means the information of being A and B are not useful in the final model, while being C and D are.

If we fit the model again just using the binary columns for C,D for prediction again, this works perfectly well, because samples with A,B for the category will not be defined as not C or not D.

So it depends on what is the aim of doing the lasso. If it is prediction, that is to select variables and refit it again into a linear model (or lasso), then passing the numpy array would be fine.

If you would want to identify features that are so called important, you might have to look into what is kept and infer what it means.

OneHotEncoding losing column identity for Lasso Regression

2 Answers2

Using pandas.get_dummies()

Using OneHotEncoder from sklearn

Using `pandas.get_dummies()`