1

I'm using GridSearchCV to perform feature selection (SelectKBest) for a linear regression. The results show that 10 features are selected (using .best_params_), but I'm unsure how to display which features this are.

The code is pasted below. I'm using a pipeline because the next models will also need hyperparameter selection. x_train is a dataframe with 12 columns that I cannot share due to data restrictions.

cv_folds = KFold(n_splits=5, shuffle=False)
steps = [('feature_selection', SelectKBest(mutual_info_regression, k=3)), ('regr', 
LinearRegression())]
pipe = Pipeline(steps)

search_space = [{'feature_selection__k': [1,2,3,4,5,6,7,8,9,10,11,12]}]

clf = GridSearchCV(pipe, search_space, scoring='neg_mean_squared_error', cv=5, verbose=0)
clf = clf.fit(x_train, y_train)

print(clf.best_params_)
marieke
  • 35
  • 3

1 Answers1

3

You can access the information about feature_selection step like this:

<GridSearch_model_variable>.best_estimater_.named_steps[<feature_selection_step>]

So, in your case, it would be like this:

print(clf.best_estimator_.named_steps['feature_selection'])
#Output: SelectKBest(k=8, score_func=<function mutual_info_regression at 0x13d37b430>)

Next you can use the get_support function to get the boolean map of the selected features:

print(clf.best_estimator_.named_steps['feature_selection'].get_support())
# Output: array([ True, False,  True, False,  True,  True,  True, False, False,
        True,  True, False,  True])

Now provide this map over the original columns:

data_columns = X.columns # List of columns in your dataset

# This is the original list of columns
print(data_columns)
# Output: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT']

# Now print the select columns
print(data_columns[clf.best_estimator_.named_steps['feature_selection'].get_support()])
# Output: ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'TAX', 'PTRATIO', 'LSTAT']

So you can see out of 13 features only 8 were selected ( as in my data k=4 was the best case)

Here is the full code with boston dataset:

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

boston_dataset = load_boston()
X = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
y = boston_dataset.target

cv_folds = KFold(n_splits=5, shuffle=False)
steps = [('feature_selection', SelectKBest(mutual_info_regression, k=3)),
         ('regr', LinearRegression())]

pipe = Pipeline(steps)

search_space = [{'feature_selection__k': [1,2,3,4,5,6,7,8,9,10,11,12]}]

clf = GridSearchCV(pipe, search_space, scoring='neg_mean_squared_error', cv=5, verbose=0)
clf = clf.fit(X, y)

print(clf.best_params_)

data_columns = X.columns
selected_features = data_columns[clf.best_estimator_.named_steps['feature_selection'].get_support()]

print(selected_features)
# Output : Index(['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'TAX', 'PTRATIO', 'LSTAT'], dtype='object')

References:

Gambit1614
  • 8,547
  • 1
  • 25
  • 51