I am trying to do classification over an imbalnced dataset (2000 data-points from positive class and 98880 data-points from negative class). I use Precision, Recall, F-Score and AUC to report the models performacne but the way that these models behave made me suprised. You can see the models results in the following:
TP:1982, TN:87920, FP:10960, FN:18 | PR:0.153, RE:0.991, F1:0.265, AUC:0.972
TP:22, TN:98877, FP:3, FN:1978 | PR:0.880, RE:0.011, F1:0.022, AUC:0.810
TP:148, TN:98271, FP:609, FN:1852 | PR:0.196, RE:0.074, F1:0.107, AUC:0.700
TP:1611, TN:98847, FP:33, FN:389 | PR:0.980, RE:0.805, F1:0.884, AUC:0.998
As you can see,
- In the first model, the precision is very low and recall is very high, which leads to low F-Score and high AUC.
- In the second model, the precision is high and the recall is low, but the results is similar, high AUC and low F-Score.
In the third model, both precison and reacall are very low which results low F-Score, but suprisingly AUC is still fairly high
In the fourth model, the precision and recall are high, therefore the F-Score and AUC are high
So, can I conclude, for my problem F-Score is a better performance metric than AUC ?