High mAP@50 with low precision and recall. What does it mean and what metric should be more important?

Question

I am comparing models for the detection of objects for maritime Search and Rescue (SAR) purposes. From the models that I used, I got the best results for the improved version of YOLOv3 for small object detection and for FASTER RCNN.

For YOLOv3 I got the best mAP@50, but for FASTER RCNN I got better all other metrics (precision, recall, F1 score). Now I am wondering how to read it and which model is really better in this case?

I would like to add that there are only two classes in the dataset: small and large objects. We chose this solution because the objects' distinction between classes is not as important to us as the detection of any human origin object.

However, small objects don't mean small GT bounding boxes. These are objects that actually have a small area - less than 2 square meters (e.g. people, buoys). Large objects are objects with a larger area (boats, ships, canoes, etc.).

Here are the results per category:

And two sample images from the dataset (with YOLOv3 detections):

score 15 · Accepted Answer · answered Jul 19 '20 at 18:53

The mAP for object detection is the average of the AP calculated for all the classes. mAP@0.5 means that it is the mAP calculated at IOU threshold 0.5.

The general definition for the Average Precision(AP) is finding the area under the precision-recall curve.

The process of plotting the model's precision and recall as a function of the model’s confidence threshold is the precision recall curve.

Precision measures how accurate is your predictions. i.e. the percentage of your predictions that are correct. Recall measures how good you find all the positives. F1 score is HM (Harmonic Mean) of precision and recall.

To answer your questions now.

How to read it and which model is really better in this case?

The mAP is a good measure of the sensitivity of the neural network. So good mAP indicates a model that's stable and consistent across difference confidence thresholds. In your case faster rcnn results indicate that precision-recall curve metric is bad compared to that of Yolov3, which means that either faster rcnn has very bad recall at higher confidence thresholds or very bad precision at lower confidence threshold compared to that of Yolov3 (especially for small objects).
Precision, Recall and F1 score are computed for given confidence threshold. I'm assuming you're running the model with default confidence threshold (could be 0.25). So higher Precision, Recall and F1 score of faster rcnn indicate that at that confidence threshold it's better in terms of all the 3 metric compared to that of Yolov3.

What metric should be more important?

In general to analyse better performing model, I would suggest you to use validation set (data set that is used to tune hyper-parameters) and test set (data set that is used to assess the performance of a fully-trained model).

Note: FP - False Positive FN - False Negative

On validation set:

Use mAP to select best performing model (model that is more stable and consistent) out of all the trained weights across iterations/epochs. Use mAP to understand whether model should be trained/tuned further or not.
Check class level AP values to ensure model is stable and good across the classes.
As per use-case/application, if you're completely tolerant to FNs and highly intolerant to FPs then to train/tune the model accordingly use Precision.
As per use-case/application, if you're completely tolerant to FPs and highly intolerant to FNs then to train/tune the model accordingly use Recall.

On test set:

If you're neutral towards FPs and FNs, then use F1 score to evaluate best performing model.
If FPs are not acceptable to you (without caring much about FNs) pick the model with higher Precision
If FNs are not acceptable to you (without caring much about FPs) pick the model with higher Recall
Once you decide metric you should be using, try out multiple confidence thresholds (say for example - 0.25, 0.35 and 0.5) for given model to understand for which confidence threshold value the metric you selected works in your favour and also to understand acceptable trade off ranges (say you want Precision of at least 80% and some decent Recall). Once confidence threshold is decided, you use it across different models to find out best performing model.

You said - "So good mAP indicates a model that's stable and consistent across difference confidence thresholds." However, when we calculate mAP, we vary the IOU threshold, right? For e.g. when we say mAP@0.5:0.05:0.95, we mean that mAP calculated IOU thresholds of - 0.5,0.55,06....0.95. So, shouldn't the statement be - "good mAP indicates a model that's stable and consistent across different IOU thresholds"? — ARPIT PRASHANT BAHETY, Apr 16 '21 at 05:20
mAP is calculating across difference CONFIDENCE thresholds like Venkatesh said. When we calculate mAP@0.5:0.05:0.95 we are doing it ALSO across different IoU thresholds. But calculating across IoU is only an addition, when calculating across conf. threshold is a definition of mAP. — Panicum, May 02 '21 at 18:12

High mAP@50 with low precision and recall. What does it mean and what metric should be more important?

1 Answers1