I have to use a Decision Tree for binary classification on a unbalanced dataset(50000:0, 1000:1). To have a good Recall (0.92) I used RandomOversampling function found in module Imblearn and pruning with max_depth parameter. The problem is that the Precision is very low (0.44), I have too many false positives.
I tried to train a specific classifier to deal with borderline instances that generate false positives. First I splitted dataset in train and test sets(80%-20%). Then I splitted train in train2 and test2 sets (66%,33%). I used a dtc(#1) to predict test2 and i took only the instances predicted as true. Then I trained a dtc(#2) on all these datas with the goal of build a classifier able to distinguish borderline cases. I used the dtc(#3) trained on first oversampled train set to predict official test set and got Recall=0.92 and Precision=0.44. Finally I used dtc(#2) only on datas predicted as true by dtc(#3) with hope to distinguish TP from FP but it doesn't work too good. I got Rec=0.79 and Prec=0.69.
x_train, X_test, y_train, Y_test =train_test_split(df2.drop('k',axis=1), df2['k'], test_size=test_size, random_state=0.2)
x_res, y_res=ros.fit_resample(x_train,y_train)
df_to_trick=df2.iloc[x_train.index.tolist(),:]
#....split in 0.33-0.66, trained and tested
confusion_matrix(y_test,predicted1) #dtc1
array([[13282, 266],
[ 18, 289]])
#training #dtc2 only on (266+289) datas
confusion_matrix(Y_test,predicted3) #dtc3 on official test set
array([[9950, 294],
[ 20, 232]])
confusion_matrix(true,predicted4)#here i used dtc2 on (294+232) datas
array([[204, 90],
[ 34, 198]])
I have to choose between dtc3 (Recall=0.92, Prec=0.44) or the entire cervellotic process with (Recall=0.79, Prec=0.69). Do you have any ideas to improve these metrics? My goal is about (0.8/0.9).