python feature selection in pipeline: how determine feature names?

Question

i used pipeline and grid_search to select the best parameters and then used these parameters to fit the best pipeline ('best_pipe'). However since the feature_selection (SelectKBest) is in the pipeline there has been no fit applied to SelectKBest.

I need to know the feature names of the 'k' selected features. Any ideas how to retrieve them? Thank you in advance

from sklearn import (cross_validation, feature_selection, pipeline,
                     preprocessing, linear_model, grid_search)
folds = 5
split = cross_validation.StratifiedKFold(target, n_folds=folds, shuffle = False, random_state = 0)

scores = []
for k, (train, test) in enumerate(split):

    X_train, X_test, y_train, y_test = X.ix[train], X.ix[test], y.ix[train], y.ix[test]

    top_feat = feature_selection.SelectKBest()

    pipe = pipeline.Pipeline([('scaler', preprocessing.StandardScaler()),
                                 ('feat', top_feat),
                                 ('clf', linear_model.LogisticRegression())])

    K = [40, 60, 80, 100]
    C = [1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001]
    penalty = ['l1', 'l2']

    param_grid = [{'feat__k': K,
                  'clf__C': C,
                  'clf__penalty': penalty}]

    scoring = 'precision'

    gs = grid_search.GridSearchCV(estimator=pipe, param_grid = param_grid, scoring = scoring)
    gs.fit(X_train, y_train)

    best_score = gs.best_score_
    scores.append(best_score)

    print "Fold: {} {} {:.4f}".format(k+1, scoring, best_score)
    print gs.best_params_

best_pipe = pipeline.Pipeline([('scale', preprocessing.StandardScaler()),
                          ('feat', feature_selection.SelectKBest(k=80)),
                          ('clf', linear_model.LogisticRegression(C=.0001, penalty='l2'))])

best_pipe.fit(X_train, y_train)
best_pipe.predict(X_test)

jakevdp · Accepted Answer · 2015-10-28T12:55:02.043

8

You can access the feature selector by name in best_pipe:

features = best_pipe.named_steps['feat']

Then you can call transform() on an index array to get the names of the selected columns:

X.columns[features.transform(np.arange(len(X.columns)))]

The output here will be the eighty column names selected in the pipeline.

edited Oct 28 '15 at 12:55

answered Oct 27 '15 at 20:54

jakevdp

77,104
11
125
160

A real treat to receive the solution from you Jake, you actually helped me learn python with your pycon tutorial videos. However, I get the error "could not convert string to float: score_575-600" (score_575-600 is the name of one of the columns) how can this be resolved? – figgy Oct 28 '15 at 12:40
Ah – I forgot that the feature selector doesn't work on strings. Try the updated version above. Glad to hear the videos were helpful! – jakevdp Oct 28 '15 at 12:55
1

still not sure how to avoid the error above, but this double step solution at least got me the column names for the k best features: features = best_pipe.named_steps['feat'].get_support() x_cols = X.columns.values[features==True] x_cols – figgy Oct 28 '15 at 13:36
Great, the updated version works!!! although not exactly clear how or why...posted my comment before refreshing so did not see the updated version earlier. – figgy Oct 28 '15 at 14:37

score 8 · Answer 2 · edited Sep 06 '17 at 17:23

8

Jake's answer totally works. But depending on what feature selector your using, there's another option that I think is cleaner. This one worked for me:

X.columns[features.get_support()]

It gave me an identical answer to Jake's answer. And you can see more about it in the docs, but get_support returns an array of true/false values for whether or not the column was used. Also, it's worth noting that X must be of identical shape to the training data used on the feature selector.

edited Sep 06 '17 at 17:23

andrew

2,524
2
24
36

answered Apr 03 '17 at 16:21

bwest87

1,223
13
12

Definitely prefer this answer, `features.transform(np.arange(len(X.columns)))` is basically longhand for `features.get_support()`. – andrew Sep 06 '17 at 15:48

score 5 · Answer 3 · edited Apr 19 '16 at 01:36

This could be an instructive alternative: I encountered a similar need as what was asked by OP. If one wants to get the k best features' indices directly from GridSearchCV:

finalFeatureIndices = gs.best_estimator_.named_steps["feat"].get_support(indices=True)

And via index manipulation, can get your finalFeatureList:

finalFeatureList = [initialFeatureList[i] for i in finalFeatureIndices]

python feature selection in pipeline: how determine feature names?

3 Answers3

Linked