Get names of the most important features for Logistic Regression after transformation

Question

I want to get names of the most important features for Logistic regression after transformation.

columns_for_encoding = ['a', 'b', 'c', 'd', 'e', 'f',
                        'g','h','i','j','k','l', 'm', 
                        'n', 'o', 'p']

columns_for_scaling = ['aa', 'bb', 'cc', 'dd', 'ee']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), columns_for_encoding),
                                                        ('Normalizer', Normalizer(), columns_for_scaling)],
                                          remainder='passthrough')

I know that I can do this:

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size = 0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

clf = LogisticRegression(max_iter = 5000, class_weight = {1: 3.5, 0: 1})
model = clf.fit(x_train, y_train)

importance = model.coef_[0]

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

But with this I'm getting feature1, feature2, feature3...etc. And after transformation I have around 45k features.

How can I get the list of most important features (before transformation)? I want to know what are the best features for the model. I have a lot of categorical features with 100+ different categories, so after encoding I'm having more features than rows in my dataset. So I want to find out what features can I exclude from my dataset and what features are the best for my model.

IMPORTANT I have other features that are used but not transformed...because of that I put remainder='passthrough'

I want to know what are the best features for the model. I have a lot of categorical features with 100+ different categories, so after encoding I'm having more features than rows in my dataset. So I want to find out what features can I exclude from my dataset and what features are the best for my model — taga, Nov 15 '21 at 15:04

jdsurya · Accepted Answer · 2021-11-15T20:03:16.390

As you would already be aware that the whole idea of feature importances is bit tricky for the case of LogisticRegression. You can read more about it from these posts:

I personally found these and other similar posts inconclusive so I am going to avoid this part in my answer and address your main question about feature splitting and aggregating the feature importances (assuming they are available for the split features) using a RandomForestClassifier. I am also assuming that the importance of a parent feature is sum total of that of the child features.

Under these assumptions, we can use the below code to have the importances of the original features. I am using the Palmer Archipelago (Antarctica) penguin data for the illustration.

df = pd.read_csv('./data/penguins_size.csv')
df = df.dropna()
# to comply with the assumption later that column names don't contain _
df.columns = [c.replace('_', '-') for c in df.columns]

X = df.iloc[:, :-1]
y = np.asarray(df.iloc[:, 6] == 'MALE').astype(int)

pd.options.display.width = 0
print(X.head())

species	island	culmen-length-mm	culmen-depth-mm	flipper-length-mm	body-mass-g
Adelie	Torgersen	39.1	18.7	181.0	3750.0
Adelie	Torgersen	39.5	17.4	186.0	3800.0
Adelie	Torgersen	40.3	18.0	195.0	3250.0
Adelie	Torgersen	36.7	19.3	193.0	3450.0
Adelie	Torgersen	39.3	20.6	190.0	3650.0

columns_for_encoding = ['species', 'island']
columns_for_scaling = ['culmen-length-mm', 'culmen-depth-mm']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown="ignore"), columns_for_encoding), ('Normalizer', Normalizer(), columns_for_scaling)], remainder='passthrough')

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

clf = RandomForestClassifier(max_depth=5)
model = clf.fit(x_train, y_train)

importance = model.feature_importances_

# feature names derived from the encoded columns and their individual importances
# encoded cols
enc_col_out = transformerVectoriser.named_transformers_['Vector Cat'].get_feature_names_out()
enc_col_out_imp = importance[transformerVectoriser.output_indices_['Vector Cat']]
# normalized cols
norm_col = transformerVectoriser.named_transformers_['Normalizer'].feature_names_in_
norm_col_imp = importance[transformerVectoriser.output_indices_['Normalizer']]
# remainder cols, require a quick lookup as no transformer object exists for this case
rem_cols = []
for (tname, _, cs) in transformerVectoriser.transformers_:
    if tname == 'remainder': rem_cols = X.columns[cs]; break
rem_col_imp = importance[transformerVectoriser.output_indices_['remainder']]

# storing them in a df for easy manipulation
imp_df = pd.DataFrame({'feature': (list(enc_col_out) + list(norm_col) + list(rem_cols)), 'importance': (list(enc_col_out_imp) + list(norm_col_imp) + list(rem_col_imp))})

# aggregating, assuming that column names don't contain _ just to keep it simple
imp_df['feature'] = imp_df['feature'].apply(lambda x: x.split('_')[0])
imp_agg = imp_df.groupby(by=['feature']).sum()
print(imp_agg)
print(f'Sum of feature importances: {imp_df["importance"].sum()}')

Output:

Ok, but what about all other features? You have only island and species in your output, what about all other features? — taga, Nov 15 '21 at 19:01
I ignored them as they have one-to-one mapping to the derivatives, but, I see that it will be useful to include them. Now we have them too. — jdsurya, Nov 15 '21 at 20:04
Thanks! One question, what If I get importance score less than 0 (negative value) ? — taga, Nov 15 '21 at 21:44
The scores you get are coefficients of regression, while they are indicative of the importance, they are not importances themselves. You can attempt to calculate importances using some reliable method (one you can try is the link #2 I shared, but I can not recommend it). Once you have importance numbers, they will never be negative. In an attempt to try something primitive, you can also ignore the sign of the coefficient and use it as importance (bigger the coefficient on either side, the more important the feature is). In that case ensure to take care of feature variances at least. — jdsurya, Nov 15 '21 at 22:09

Get names of the most important features for Logistic Regression after transformation

1 Answers1

Linked