2

I want to get names of the most important features for Logistic regression after transformation.

columns_for_encoding = ['a', 'b', 'c', 'd', 'e', 'f',
                        'g','h','i','j','k','l', 'm', 
                        'n', 'o', 'p']

columns_for_scaling = ['aa', 'bb', 'cc', 'dd', 'ee']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), columns_for_encoding),
                                                        ('Normalizer', Normalizer(), columns_for_scaling)],
                                          remainder='passthrough') 

I know that I can do this:

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size = 0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

clf = LogisticRegression(max_iter = 5000, class_weight = {1: 3.5, 0: 1})
model = clf.fit(x_train, y_train)

importance = model.coef_[0]

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

But with this I'm getting feature1, feature2, feature3...etc. And after transformation I have around 45k features.

How can I get the list of most important features (before transformation)? I want to know what are the best features for the model. I have a lot of categorical features with 100+ different categories, so after encoding I'm having more features than rows in my dataset. So I want to find out what features can I exclude from my dataset and what features are the best for my model.

IMPORTANT I have other features that are used but not transformed...because of that I put remainder='passthrough'

taga
  • 3,537
  • 13
  • 53
  • 119
  • I want to know what are the best features for the model. I have a lot of categorical features with 100+ different categories, so after encoding I'm having more features than rows in my dataset. So I want to find out what features can I exclude from my dataset and what features are the best for my model – taga Nov 15 '21 at 15:04

1 Answers1

3

As you would already be aware that the whole idea of feature importances is bit tricky for the case of LogisticRegression. You can read more about it from these posts:

  1. How to find the importance of the features for a logistic regression model?
  2. Feature Importance in Logistic Regression for Machine Learning Interpretability
  3. How to Calculate Feature Importance With Python

I personally found these and other similar posts inconclusive so I am going to avoid this part in my answer and address your main question about feature splitting and aggregating the feature importances (assuming they are available for the split features) using a RandomForestClassifier. I am also assuming that the importance of a parent feature is sum total of that of the child features.

Under these assumptions, we can use the below code to have the importances of the original features. I am using the Palmer Archipelago (Antarctica) penguin data for the illustration.

df = pd.read_csv('./data/penguins_size.csv')
df = df.dropna()
# to comply with the assumption later that column names don't contain _
df.columns = [c.replace('_', '-') for c in df.columns]

X = df.iloc[:, :-1]
y = np.asarray(df.iloc[:, 6] == 'MALE').astype(int)

pd.options.display.width = 0
print(X.head())
species island culmen-length-mm culmen-depth-mm flipper-length-mm body-mass-g
Adelie Torgersen 39.1 18.7 181.0 3750.0
Adelie Torgersen 39.5 17.4 186.0 3800.0
Adelie Torgersen 40.3 18.0 195.0 3250.0
Adelie Torgersen 36.7 19.3 193.0 3450.0
Adelie Torgersen 39.3 20.6 190.0 3650.0
columns_for_encoding = ['species', 'island']
columns_for_scaling = ['culmen-length-mm', 'culmen-depth-mm']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown="ignore"), columns_for_encoding), ('Normalizer', Normalizer(), columns_for_scaling)], remainder='passthrough')

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

clf = RandomForestClassifier(max_depth=5)
model = clf.fit(x_train, y_train)

importance = model.feature_importances_

# feature names derived from the encoded columns and their individual importances
# encoded cols
enc_col_out = transformerVectoriser.named_transformers_['Vector Cat'].get_feature_names_out()
enc_col_out_imp = importance[transformerVectoriser.output_indices_['Vector Cat']]
# normalized cols
norm_col = transformerVectoriser.named_transformers_['Normalizer'].feature_names_in_
norm_col_imp = importance[transformerVectoriser.output_indices_['Normalizer']]
# remainder cols, require a quick lookup as no transformer object exists for this case
rem_cols = []
for (tname, _, cs) in transformerVectoriser.transformers_:
    if tname == 'remainder': rem_cols = X.columns[cs]; break
rem_col_imp = importance[transformerVectoriser.output_indices_['remainder']]

# storing them in a df for easy manipulation
imp_df = pd.DataFrame({'feature': (list(enc_col_out) + list(norm_col) + list(rem_cols)), 'importance': (list(enc_col_out_imp) + list(norm_col_imp) + list(rem_col_imp))})

# aggregating, assuming that column names don't contain _ just to keep it simple
imp_df['feature'] = imp_df['feature'].apply(lambda x: x.split('_')[0])
imp_agg = imp_df.groupby(by=['feature']).sum()
print(imp_agg)
print(f'Sum of feature importances: {imp_df["importance"].sum()}')

Output:

enter image description here

jdsurya
  • 1,326
  • 8
  • 16
  • Ok, but what about all other features? You have only island and species in your output, what about all other features? – taga Nov 15 '21 at 19:01
  • I ignored them as they have one-to-one mapping to the derivatives, but, I see that it will be useful to include them. Now we have them too. – jdsurya Nov 15 '21 at 20:04
  • Thanks! One question, what If I get importance score less than 0 (negative value) ? – taga Nov 15 '21 at 21:44
  • The scores you get are coefficients of regression, while they are indicative of the importance, they are not importances themselves. You can attempt to calculate importances using some reliable method (one you can try is the link #2 I shared, but I can not recommend it). Once you have importance numbers, they will never be negative. In an attempt to try something primitive, you can also ignore the sign of the coefficient and use it as importance (bigger the coefficient on either side, the more important the feature is). In that case ensure to take care of feature variances at least. – jdsurya Nov 15 '21 at 22:09