1

I have a dataset with more than 100k rows and 1000 columns/features and one output(0 and 1). I want to select the best features / columns for my model. I was thinking of combining multiple methods of feature selection in scikit-learn but I do not know if this is the right procedure, or if its the correct way of doing. Also, you will see that in the code below, when I use pca it says that column f1 is the most important features, and In the end it says that I should use column 2 (feature f2), why Is this happening, is this good/correct/normal? Please see the code below, I have used dummy data for this:

import pandas as pd

from sklearn.feature_selection import RFE, SelectFromModel
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC


df = pd.DataFrame({'f1':[1,5,3,4,5,16,3,1,0],
                   'f2':[0.1,0.5,0.3,0.4,0.5,1.6,0.3,0.1,1],
                   'f3':[12,41,53,13,53,13,65,24,21],
                   'f4':[1,6,3,4,4,18,5,2,5],
                   'f5':[10,15,32,41,51,168,27,13,2],
                   'result':[1,0,1,0,0,0,1,1,0]})

print(df)

x = df.iloc[:,:-1]
y = df.iloc[:,-1]

# Printing the shape of my data before PCA
print(x.shape)

# Doing PCA to reduce number of features
pca = PCA()
fit = pca.fit(x)

pca_result = list(fit.explained_variance_ratio_)
print(pca_result)

#I see that 'f1', 'f2' and 'f3' are the most important values
#so now, my x is:
x = df[['f1', 'f2', 'f3']]
print(x.shape) #new shape of x

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

classifiers = [['Linear SVM', SVC(kernel = 'linear', gamma = 'scale')],
               ['Decission tree', DecisionTreeClassifier()],
               ['Random Forest', RandomForestClassifier(n_estimators = 100)]]


# now i use 'SelectFromModel' so that I can get the optimal number of features/columns
my_acc = 0
for c in classifiers:

    clf = c[1].fit(x_train, y_train)

    model = SelectFromModel(clf, prefit=True)
    model_score = clf.score(x_test, y_test)
    column_res = model.transform(x_train).shape
    print(model_score, column_res)
    if model_score > my_acc:

        my_acc = model_score
        column_res = model.transform(x_train).shape
        number_of_columns = column_res[1]
        my_cls = c[0]

# classifier with the best accuracy and his number of columns is:
print(my_cls)
print('Number of columns',number_of_columns)


#Can I call 'RFE' now, is it correct / good / right thing to do?
# I want to find the best column for this
my_acc = 0
for c in classifiers:

    model = c[1]
    rfe = RFE(model, number_of_columns)
    fit = rfe.fit(x_train, y_train)
    acc = fit.score(x_test, y_test)

    if acc > my_acc:
        my_acc = acc
        list_of_results = fit.support_

        final_model_name = c[0]
        final_model = c[1]

        print()

print(c[0])
print(my_acc)
print(list_of_results)

#I got the result that says that I should use second column, and In the PCA it says that first column is the most important
#Is this good / normal / correct?

Is this the right way, or am I doing something wrong?

taga
  • 3,537
  • 13
  • 53
  • 119
  • The problem with the above approach is that you are using the training data `x,y` to optimize the feature selection process. This will result in overfitting. The correct way is to use pipelines of transformers/classifiers and some type of training/validation splitting where all the training (PCA, feature selectors, classifiers etc) occurs on the training data and you evaluate your model on the validation data. `scikit-learn` provides various ways of doing it like the `Pipeline` class, `train_test_split` function, `cross_val_score` function, `GridSearchCV` class etc – Georgios Douzas Aug 20 '19 at 10:18
  • @GeorgiosDouzas I have updated the question, please check it, i have added `train_test_split`. Also, I do not know how would I place all this in `pipeline` – taga Aug 20 '19 at 10:48
  • 1
    PCA maps the data onto a new feature space and then sorts the features according to their contribution to the explained variance. Since the new pca features do not correspond your original features, picking f1 based on the results from pca is not right. – KRKirov Aug 20 '19 at 11:31
  • Please describe what exactly you are trying to do. Is it correct to assume you would like to build pipelines of (PCA, SelectFromModel, RFA, Classifier) with various hyper-parameters (for multiple classifiers) and then select the pipeline with the highest cross-validation score? – Georgios Douzas Aug 20 '19 at 11:36
  • @GeorgiosDouzas yes, thats right – taga Aug 20 '19 at 11:58
  • @KRKirov Oh, so If I remove part where I do `PCA` in the code above, the code will be correct? Also, Is there a way to see PCA with unsorted values? – taga Aug 20 '19 at 12:00
  • @taga. This is not how you it is done. If you use PCA pick the new features with the highest explained variance but don't try to interpret them as any of the old features. Just accept that they are new features derived from the original data. – KRKirov Aug 20 '19 at 12:26
  • @KRKirov Ok, I understand, but my goal is to find witch features (their names) are the best for my model. That means that `PCA` can not help me? – taga Aug 20 '19 at 12:33
  • 1
    That's right - in general, the features produced by PCA do not correspond to any of the original features. However, you can still use the features from PCA in model fitting. – KRKirov Aug 20 '19 at 13:19
  • @KRKirov Thanks a lot! I think that I can use `test = SelectKBest()`, then `fit = test.fit(x, y)` and then`kbest_score = fit.scores_`. In that way I can select the features with highest score – taga Aug 20 '19 at 13:25

1 Answers1

0

to explain your code:

pca = PCA()
fit = pca.fit(x)

pca will keep all your features: Number of components to keep. if n_components is not set all components are kept

to the command:

pca_result = list(fit.explained_variance_ratio_)

this post explains it quite well: Python scikit learn pca.explained_variance_ratio_ cutoff

you should use:

fit.explained_variance_ratio_.cumsum()

as the output is the variance in % that you would keep with each dimension. To use pca for feature importance is wrong.

Only the part with SelectModelmakes sense for feature selection. You could run SelectModelin your first step and use afterwards a PCA for further dimension reduction, but if you have enough memory to run it, there is no need to reduce your dimension.

PV8
  • 5,799
  • 7
  • 43
  • 87
  • Can I use `test = SelectKBest()`, then `fit = test.fit(x, y)` and then `kbest_score = fit.scores_` instead of `PCA`. In that way I can select the features with highest score? Is there a way to get participation / importance of each feature in the other way? – taga Aug 20 '19 at 16:39
  • this example should help: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest – PV8 Aug 21 '19 at 05:50