1

I have to use a Decision Tree for binary classification on a unbalanced dataset(50000:0, 1000:1). To have a good Recall (0.92) I used RandomOversampling function found in module Imblearn and pruning with max_depth parameter. The problem is that the Precision is very low (0.44), I have too many false positives.

I tried to train a specific classifier to deal with borderline instances that generate false positives. First I splitted dataset in train and test sets(80%-20%). Then I splitted train in train2 and test2 sets (66%,33%). I used a dtc(#1) to predict test2 and i took only the instances predicted as true. Then I trained a dtc(#2) on all these datas with the goal of build a classifier able to distinguish borderline cases. I used the dtc(#3) trained on first oversampled train set to predict official test set and got Recall=0.92 and Precision=0.44. Finally I used dtc(#2) only on datas predicted as true by dtc(#3) with hope to distinguish TP from FP but it doesn't work too good. I got Rec=0.79 and Prec=0.69.

x_train, X_test, y_train, Y_test =train_test_split(df2.drop('k',axis=1), df2['k'], test_size=test_size, random_state=0.2)
x_res, y_res=ros.fit_resample(x_train,y_train)

df_to_trick=df2.iloc[x_train.index.tolist(),:]
#....split in 0.33-0.66, trained and tested
confusion_matrix(y_test,predicted1) #dtc1
array([[13282,   266],
       [   18,   289]])

#training #dtc2 only on (266+289) datas

confusion_matrix(Y_test,predicted3) #dtc3 on official test set
array([[9950,  294],
       [  20,  232]])

confusion_matrix(true,predicted4)#here i used dtc2 on (294+232) datas
array([[204,  90],
       [ 34, 198]])

I have to choose between dtc3 (Recall=0.92, Prec=0.44) or the entire cervellotic process with (Recall=0.79, Prec=0.69). Do you have any ideas to improve these metrics? My goal is about (0.8/0.9).

AlanT
  • 11
  • 3
  • What about using something like GridSearchCV with a roc_auc_score parameter? https://stackoverflow.com/questions/49061575/why-when-i-use-gridsearchcv-with-roc-auc-scoring-the-score-is-different-for-gri and https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc – bart cubrich Mar 27 '19 at 19:00

1 Answers1

0

Keep in mind that precision and recall are based on the threshold that you choose (i.e. in sklearn the default threshold is 0.5 - any class with a prediction probability > 0.5 is classified as positive) and that there will always be a trade-off between favoring precision over recall. ...

I think in the case you are describing (trying to fine-tune your classifier given your model's performance limitations) you can choose a higher or lower threshold to cut-off which has a more favorable precision-recall tradeoff ...

The below code can help you visualize how your precision and recall change as you move your decision threshold:

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.figure(figsize=(8, 8))
    plt.title("Precision and Recall Scores as a function of the decision threshold")
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.ylabel("Score")
    plt.xlabel("Decision Threshold")
    plt.legend(loc='best')

Other suggestions to improve your model's performance is to either use alternative pre-processing methods - SMOTE instead of Random Oversampling or choosing a more complex classifier (a random forrest/ ensemble of trees or a boosting approach ADA Boost or Gradient based boosting)

marwan
  • 504
  • 4
  • 14