Names for Feature Selection

Question

I want to know the names of the features within my RF model. I read here that the output from gs.best_estimator_.named_steps["stepname"].feature_importances_ would mirror my columns from my data. However, the length of gs.best_estimator_.... is 10 and I have 13 columns. Some columns were not important. From other answers around (answer1, answer2), I would have to declare something within my pipeline. But I am confused as to what to declare because both answers deal with PCA, not RF.

Here is what I have so far.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import datasets

# use iris as example
iris = datasets.load_iris()
X = iris.drop(['sepal_length'],axis=1)
y = iris.sepal_length
cats_feats = ['species']
X_train, X_test, y_train, y_test = \
        train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=13)
# Pipeline
categorical_transformer = Pipeline(steps=[
                ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False))
                                    ])
# Bundle any preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, cat_feats)
    ])
rf = RandomForestRegressor(random_state = 13)
mymodel = Pipeline(steps = [('preprocessor', preprocessor),
                            ('model', rf)
                            ])
# For this example, I used default values. In reality I do use a dictionary of parameters
gs = GridSearchCV(mymodel
                           ,n_jobs = -1
                           ,cv = 5
                           )
gs.fit(X_train,y_train)

score 1 · Accepted Answer · answered Apr 14 '20 at 21:43

Why the length of the feature list does not match

The length of your features does not match because all non-categorical columns are being discarded when you are using your ColumnTransformer. By default, it only keeps columns for which a transformation was specified. As a result, if you do not want this to happen, you need to do this

preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(), cat_feats)],
                                 remainder='passthrough')

(I removed your categorical pipeline, which is not necessary here)

Also keep in mind that applying the OHE will add features and so the total number of features is going to be larger than what you had in the beginning.

How to get the feature names

Once you have fitted everything, you need to retrieve the feature names for the result of the OHE and the remaining numerical columns.

For the OHE columns:

cat_features = gs.best_estimator_["preprocessor"].named_transformers_["cat"].get_feature_names()

For the numerical columns, you need to declare num_feats where all numerical features are in the same order as in your original dataframe.

Then just do:

feature_names = np.concatenate((cat_features, num_feats))

PS: this is a bit cumbersome, and this might be improved in later sklearn versions, but as of now, this is the procedure

So like this: `num_feats = ['sepeal_width','petal_length',petal_width']` — Jack Armstrong, Apr 15 '20 at 13:56
Also what if I have ordered the data in the dataframe as like cat1, num1, cat2, cat3, num2, num3, etc. So categorical variable 1, numerical variable 1, categorical variable 2..... Would your method still work? Or would you recommend organizing the dataframe first? — Jack Armstrong, Apr 15 '20 at 14:23
For sake of completeness, I did re-organize my dataframe and there were no differences. It seems like as long as the categorical variables are first, then the numerical ones in the order of the dataframe, the user should be good. — Jack Armstrong, Apr 15 '20 at 14:51
It should not matter. What matters is that when creating the `feature_name` array is that you put the categorical features as computed above, and then all numerical features in the same order as in they appear from left to right in the dataframe. — MaximeKan, Apr 15 '20 at 16:21

Names for Feature Selection

1 Answers1

Why the length of the feature list does not match

How to get the feature names