Recursive feature selection may not yield higher performance?

Question

I'm tring to analyze below data, modeled it with logistic regression first and then did the prediction, calculated the accuracy & auc; then performed recursive feature selection and calculated accuracy & auc again, thought the accuracy and auc would be higher, but actually they are both lower after the recursive feature selection, not sure whether it's expected? Or did I miss something?

Data: https://github.com/amandawang-dev/census-training/blob/master/census-training.csv

---------------------- for logistic regression, Accuracy: 0.8111649491571692; AUC: 0.824896256487386

after recursive feature selection, Accuracy: 0.8130075752405651; AUC: 0.7997315631730443

import pandas as pd
import numpy as np
from sklearn import preprocessing, metrics
from sklearn.model_selection import train_test_split


train=pd.read_csv('census-training.csv')
train = train.replace('?', np.nan)
for column in train.columns:
    train[column].fillna(train[column].mode()[0], inplace=True)
x['Income'] = x['Income'].str.contains('>50K').astype(int)
x['Gender'] = x['Gender'].str.contains('Male').astype(int)

obj = train.select_dtypes(include=['object']) #all features that are 'object' datatypes
le = preprocessing.LabelEncoder()
for i in range(len(obj.columns)):
    train[obj.columns[i]] = le.fit_transform(train[obj.columns[i]])#TODO  #Encode input data

train_set, test_set = train_test_split(train, test_size=0.3, random_state=42)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score


log_rgr = LogisticRegression(random_state=0)


X_train=train_set.iloc[:, 0:9]
y_train=train_set.iloc[:, 9:10]


X_test=test_set.iloc[:, 0:9]
y_test=test_set.iloc[:, 9:10]

log_rgr.fit(X_train, y_train)

y_pred = log_rgr.predict(X_test)

lr_acc = accuracy_score(y_test, y_pred)

probs = log_rgr.predict_proba(X_test)
preds = probs[:,1]
print(preds)
from sklearn.preprocessing import label_binarize
y = label_binarize(y_test, classes=[0, 1]) #note to myself: class need to have only 0,1
fpr, tpr, threshold = metrics.roc_curve(y, preds)

roc_auc = roc_auc_score(y_test, preds)

print("Accuracy: {}".format(lr_acc))
print("AUC: {}".format(roc_auc))

from sklearn.feature_selection import RFE


rfe = RFE(log_rgr, 5)
fit = rfe.fit(X_train, y_train)

X_train_new = fit.transform(X_train)
X_test_new = fit.transform(X_test)

log_rgr.fit(X_train_new, y_train)
y_pred = log_rgr.predict(X_test_new)

lr_acc = accuracy_score(y_test, y_pred)

probs = rfe.predict_proba(X_test)
preds = probs[:,1]
y = label_binarize(y_test, classes=[0, 1]) 

fpr, tpr, threshold = metrics.roc_curve(y, preds)
roc_auc =roc_auc_score(y_test, preds)

print("Accuracy: {}".format(lr_acc))
print("AUC: {}".format(roc_auc))

Also note that you shouldn't be using a LabelEncoder to encode categorical features. See [LabelEncoder for categorical features?](https://stackoverflow.com/questions/61217713/labelencoder-for-ordinal-features/61217936#61217936) — yatu, Apr 26 '20 at 14:45

score 4 · Answer 1 · answered Apr 26 '20 at 16:16

There is simply no guarantee that any kind of feature selection (backward, forward, recursive - you name it) will actually lead to better performance in general. None at all. Such tools are there for convenience only - they may work, or they may not. Best guide and ultimate judge is always the experiment.

Apart from some very specific cases in linear or logistic regression, most notably the Lasso (which, no coincidence, actually comes from statistics), or somewhat extreme cases with too many features (aka The curse of dimensionality), even when it works (or doesn't), there is not necessarily much to explain as to why (or why not).

Recursive feature selection may not yield higher performance?

1 Answers1