I have a cleaned housing dataset with about 75 total features and 1 target variable. In order to use lasso regression for selecting the most relevant of the 75 features, I am only able to use label encoding for the categorical features, as it preserves column identity as follows:
# Label Encoding all other categorical features:
for x in categorical_features:
labels_ordered=house_df.groupby([x])['SalePrice'].mean().sort_values().index # SalePrice is target variable
labels_ordered={k:i for i,k in enumerate(labels_ordered,0)}
house_df[x]=house_df[x].map(labels_ordered)
# After splitting into train/test and fitting the lasso
feature_sel_model = SelectFromModel(Lasso(alpha=0.005, random_state=0))
feature_sel_model.fit(X_train, y_train)
# Checking the array of selected and rejected features
feature_sel_model.get_support()
O/P: array([ True, True, False, False, False, False, False, False, False,
False, True, False, False, False, False, True, True, False,
True, False, False, False, False, False, False, False, False,
True, True, False, True, False, True, False, False, False,
True, False, True, True, False, True, False, False, True,
False, False, False, False, False, False, True, False, False,
True, False, False, False, True, True, True, False, False,
True, False, False, False, False, False, False, False, False,
False, False, True])
# Making a list of the selected features
selected_feat = X_train.columns[(feature_sel_model.get_support())]
# let's print some stats
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
O/P: total features: 75
selected features: 22
The column identity is needed to use the output of lasso regression and remove the irrelevant features from the original dataset.
My problem is that the categorical features have multiple labels and not ordinal, so OneHotEncoding using sklearn would actually be the best method of encoding but would create a complex matrix, destroying column identity. How do I use the output of OHE (which is a np.arrray with all encoded variables brought to the left of the matrix) to feed to the lasso regressor? Or should I stick to label encoding?