0

I've imbalanced dataset and I applied RandomOverSampler to get balanced data set.

oversample = RandomOverSampler(sampling_strategy='minority')
X_over, y_over = oversample.fit_resample(X, y)

Afterwards I have followed this kaggle post RandomForest implementation for feature selection

https://www.kaggle.com/gunesevitan/titanic-advanced-feature-engineering-tutorial (go to the bottom of the page you will see similar implementation.)

I have a similar real data set like titanic:) and trying to get feature importances out of it!

The problem I'm having is that even though the classifier accuracy is very high ~0.99% the feature importance I'm getting is in the order of ~0.1%. What would be causing that? or its ok?

enter image description here

Here is the code I'm using, similar one that I provided in the link. Go to the bottom of the page.

classifiers = [RandomForestClassifier(random_state=SEED,
                                      criterion='gini',
                                      n_estimators=20,
                                      bootstrap=True,
                                      max_depth=5,
                                      n_jobs=-1)]
    
              #DecisionTreeClassifier(),
              #LogisticRegression(),
              #KNeighborsClassifier()]
              #GradientBoostingClassifier(),
              #SVC(probability=True), GaussianNB()]

log_cols = ["Classifier", "Accuracy"]
log      = pd.DataFrame(columns=log_cols)

SEED = 42
N = 15
skf = StratifiedKFold(n_splits=N, random_state=None, shuffle=True)

importances = pd.DataFrame(np.zeros((X.shape[1], N)), columns=['Fold_{}'.format(i) for i in range(1, N + 1)], index=data.columns)


acc_dict = {}

for fold, (train_index, test_index) in enumerate(skf.split(X_over, y_over)):
    X_train, X_test = X_over[train_index], X_over[test_index]
    y_train, y_test = y_over[train_index], y_over[test_index]
    
    for clf in classifiers:
        #pipe1=make_pipeline(sampling,clf)
        print(clf)
        name = clf.__class__.__name__
        clf.fit(X_train, y_train)
        train_predictions = clf.predict(X_test)
        acc = accuracy_score(y_test, train_predictions)
        
        
        if 'Random' in name:
            importances.iloc[:, fold - 1] = clf.feature_importances_
       
    
        if name in acc_dict:
            acc_dict[name] += acc
        else:
            acc_dict[name] = acc
        
        #doing grid search for best input parameters for RF
        #CV_rfc = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 5)
        #CV_rfc.fit(X_train, y_train)
        

for clf in acc_dict:
    acc_dict[clf] = acc_dict[clf] / 10.0
    log_entry = pd.DataFrame([[clf, acc_dict[clf]]], columns=log_cols)
    log = log.append(log_entry)

I'm getting almost the same feature importance value best is ~0.1%

enter image description here

By doing confusion Matrix check suggested from @AlexSerraMarrugat

EDIT

enter image description here Test: 0.9926166568222091 Train: 0.9999704661911724

EDIT2

Tried randomoversplit afterwards:

from imblearn.over_sampling import RandomOverSampler
oversample = RandomOverSampler(sampling_strategy='minority')
x_over, y_over = oversample.fit_resample(X_train,Y_train)
# summarize class distribution
print(Counter(y_over))
print(len(x_over))

#Creating confusion matrix

from sklearn.metrics import plot_confusion_matrix
clf = RandomForestClassifier(random_state=0) #Here change the hyperparameters
clf.fit(x_over, y_over)
predict_y=clf.predict(x_test)
plot_confusion_matrix(clf, x_test, y_test, cmap=plt.cm.Blues)
print("Test: ", clf.score(x_test, y_test))
print("Train: ", clf.score(x_over, y_over))

Test: 0.9926757235676315 Train: 1.0

enter image description here

EDIT3 Confusion matrix for Train data

from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(clf, X_train, Y_train, cmap=plt.cm.Blues)
print("Train: ", clf.score(X_train, Y_train))

enter image description here

Alexander
  • 4,527
  • 5
  • 51
  • 98
  • 1
    I can assure you, it is not correct to oversample before splitting into train and validation. You should first split and then oversample only on your train data. That is done to simulate a real world usage of your algorithm -- you would not oversample the data you want to predict irl. Explains the suspiciously high accuracy. – Gaussian Prior Jan 07 '21 at 19:07
  • @GaussianPrior Thanks for clarification.`from sklearn.model_selection import train_test_split X_train, x_test, Y_train,y_test = train_test_split(X,y, test_size = 0.2)` if split first and do `oversampling` by doing `from imblearn.over_sampling import RandomOverSampler oversample = RandomOverSampler(sampling_strategy='minority') x_over, y_over = oversample.fit_resample(X_train,Y_train)` and then do this `clf.fit(x_over, y_over)` the accuracy decreses from %99 to %0.1. – Alexander Jan 08 '21 at 06:30
  • Wait what? From 99% to 10% or from 99% to 0.1%? How many classes do you have? – Gaussian Prior Jan 08 '21 at 11:22
  • @GaussianPrior when I split train and test dataset (0.2) I got 16k 0's and 300 1's in test dataset. – Alexander Jan 08 '21 at 17:42

1 Answers1

1

First of all, as Gaussian Prior said, you have to oversample only to your train dataset. Then, once you have the model trained, test the accuracy with your data set.

If I have understood you, you have 0.1% accuracy now with your test data. Please, check if you are overfitting (If accuracy train dataset is much bigger than accuracy test data, it indicates that probably there is overfitting). Try change some hyperparameters. Use this code:

clf = RandomForestClassifier(random_state=0) #Here change the hyperparameters
clf.fit(X_train, y_train)
predict_y=clf.predict(X_test)
plot_confusion_matrix(clf, X_test, y_test, cmap=plt.cm.Blues)
print("Test: ", clf.score(X_test, y_test))
print("Train: ", clf.score(X_train, y_train))

About feature importance. I suspect that your results are correct. They are saying that you have 5 features that are the most important for your model. In my opinion, you have one of the best outputs, where you have a few important features.

You only will obtain a single big values if there's only one unique important feature (the model only obtain information from one features, which is not good at all).

Alex Serra Marrugat
  • 1,849
  • 1
  • 4
  • 14
  • Thanks. As you may see my comment about `oversampling` ---`from imblearn.over_sampling import RandomOverSampler oversample = RandomOverSampler(sampling_strategy='minority') x_over, y_over = oversample.fit_resample(X_train,Y_train)`. Am I missing something here? Do you have good example of using `oversampling` and `Randomforest` together? – Alexander Jan 08 '21 at 17:22
  • It's correct. Can you tell me the accuracy of training data (which must be a balanced dataset) and the accuracy of test data (which must be imbalanced, and it must be 0.1% as you said). – Alex Serra Marrugat Jan 08 '21 at 19:21
  • Hi, edited OP. Please check edit2. So the classifier only predicted '1' case correctly from bottom right ? after oversampling. – Alexander Jan 08 '21 at 19:35
  • Plot also the confusion matrix for the training data. It seems that you have your model over fitted because you have your 100% on it. I reccomend adjust hyperparameters: like max_depth. I reccomend this link: https://stackoverflow.com/questions/20463281/how-do-i-solve-overfitting-in-random-forest-of-python-sklearn – Alex Serra Marrugat Jan 08 '21 at 21:01
  • Thanks. I added confusion matrix of `training data`. The parameters I used also in the OP (top of the code). Nothing changed over there. – Alexander Jan 08 '21 at 21:19
  • Thank you. For sure it's a problem of over-fitting. Your model is only learning to identify the 1 of your training data, that's why you have such a bad result in your test accuracy. You have to tune your hyperparameters. Try setting max_depth=1 lower and decreasing number of features (0.3 or similar). – Alex Serra Marrugat Jan 09 '21 at 11:37