How to interpret: Label Ranking Average Precision Score

Question

I am new to Array programming and found it difficult to interpret the sklearn.metrics label_ranking_average_precision_score function. Need your help to understand the way it is calculated and any appreciate any tips to learn Numpy Array Programming.

Generally, I know Precision is
((True Positive) / (True Positive + False Positive))

The reason why I am asking is this is that I stumbled upon Kaggle Competition for Audio Tagging and came across this post that says they are using LWRAP function for calculating the score when there are more than one correct label in the response. I started to read to know how this score is calculated and found it difficult to interpret. My two difficulties are
1) Interpreting the Math function from documentation, I am not sure how ranks are used in score calculation
2) Interpreting Numpy array operations from the code
The function that I am reading is from Google Collab document then I tried reading the documentation at sklearn but couldn't understand properly.

Code for one sample calculation is

# Core calculation of label precisions for one test sample.

def _one_sample_positive_class_precisions(scores, truth):
  """Calculate precisions for each true class for a single sample.

  Args:
    scores: np.array of (num_classes,) giving the individual classifier scores.
    truth: np.array of (num_classes,) bools indicating which classes are true.

  Returns:
    pos_class_indices: np.array of indices of the true classes for this sample.
    pos_class_precisions: np.array of precisions corresponding to each of those
      classes.
  """
  num_classes = scores.shape[0]
  pos_class_indices = np.flatnonzero(truth > 0)
  # Only calculate precisions if there are some true classes.
  if not len(pos_class_indices):
    return pos_class_indices, np.zeros(0)
  # Retrieval list of classes for this sample. 
  retrieved_classes = np.argsort(scores)[::-1]
  # class_rankings[top_scoring_class_index] == 0 etc.
  class_rankings = np.zeros(num_classes, dtype=np.int)
  class_rankings[retrieved_classes] = range(num_classes)
  # Which of these is a true label?
  retrieved_class_true = np.zeros(num_classes, dtype=np.bool)
  retrieved_class_true[class_rankings[pos_class_indices]] = True
  # Num hits for every truncated retrieval list.
  retrieved_cumulative_hits = np.cumsum(retrieved_class_true)
  # Precision of retrieval list truncated at each hit, in order of pos_labels.
  precision_at_hits = (
      retrieved_cumulative_hits[class_rankings[pos_class_indices]] / 
      (1 + class_rankings[pos_class_indices].astype(np.float)))
  return pos_class_indices, precision_at_hits

IMHO [this blog](https://makarandtapaswi.wordpress.com/2012/07/02/intuition-behind-average-precision-and-map/) gives an intuitive explanation for average precision. — HaaLeo, Oct 03 '20 at 10:23

Austin · Accepted Answer · 2019-07-04T22:35:46.760

To better understand how the score is calculated, let's come up with an simple example. Pretend we are labeling images that may contain Cats, Dogs, and/or Birds. The class array looks like [Cat, Dog, Bird]. So if we have an image containing only a Cat in it, the truth array will have the form [1, 0, 0].

We feed the model this image containing only a Cat and it outputs [.9, .2, .3]. First, we rank the labels that the model predicted. Cat got 1st place, Bird got 2nd place, and Dog got 3rd place. Now we count how many labels it takes to get to the true class of interest (Cat) starting at 1st place. The model had Cat in 1st place, so this simply takes a value of 1. Next, we count how many other true labels there were until we reached the correct label (Cat). This may seem confusing at first, but it will be necessary for the multi-label examples later. For this case, the Cat label was correct and we didn't need to move any further, so this also takes a value of 1. The Score is calculated by taking the second value and dividing it by the first value. In this scenario, the Score is 1/1 = 1.

So what happens if the model gets it out of order? Let's put that same Cat image through a different model. It outputs [.6, .8, .1]. Rank the labels from first to last. Dog got 1st place, Cat got 2nd place, and Bird got 3rd place. Find out how many labels it takes to get to the correct class of interest (Cat again) starting at 1st place. For this scenario, it takes two labels so the first value is 2. Next, figure out how many correct labels there were along the way. There was only 1, so the second value is 1. For this case, the Score is 1/2 = 0.50.

Alright, so those are the simple examples. I'm not going to be as verbose for these next two, but apply the same logic as above. The main difference is that each correct label needs to be calculated separately.

Correctly Ranking Two Labels: Image contains a Cat and Bird [1, 0, 1]. The model outputs [.8, .2, .9]. Ranking is Bird, Cat, Dog. For the Cat label, the first value is 2 (took two labels to get to it) and the second value is 2 (there were two correct labels along the way, Bird and Cat). Score = 2/2 = 1. For the Bird label, the first value is 1 and the second value is 1. Score = 1/1 = 1.

Incorrectly Ranking Two Labels: Image contains a Cat and Dog [1, 1, 0]. The model outputs [.1, .2, .8]. Ranking is Bird, Dog, Cat. For the Cat label, first value is 3 and the second value is 2. Score = 2/3 = 0.66. For the Dog label, first value is 2 and the second value is 1. Score = 1/2 = 0.50.

Okay, so how do we get the final score for each class? The simplest way would be to take an average. Let's use the previous two examples to calculate this. For Cat, we had Scores of 1 and 0.66. Final Cat Score = (1+0.66)/2 = 0.83. For Dog, we had a Score of 0.50 only, so Final Dog Score = 0.50. For Bird, we had a Score of 1.0 only, so Final Bird Score = 1. This metric is great for analyzing class performance.

How can we compress these class scores into one final score? We could just average all of the final class scores, but this would dilute scores of common classes and boost less frequent classes. Instead, we can simply use a weighted-average! Using the two example images, we had 2 Cats, 1 Dog, and 1 Bird. Final Score = (2 Cats/4 Labels)*0.83 + (1 Dog/4 Labels)*0.50 + (1 Bird/4 Labels)*1.0 = 0.79. Conveniently, this ends up being the same as averaging all of the individual Scores, so we don't even need to store class weights. Recall that we had individual scores of 1 and 1 for the Cat and Bird of the first image and then 0.66 and 0.50 for the Cat and Dog of the second image. (1 + 1 + 0.66 + 0.50)/4 = 0.79.

Hopefully this gives you a better understanding of the calculations. I'll leave the code to another poster as I have droned on long enough. Perhaps if no one answers that portion soon, I can add a write-up.

Thanks for this great explanation! Would it be correct to say that a LRAP of 0 means a model learned everything backwards? (kind of like the same way a -1 $R^2$ correlation score means there is an inverse correlation) And that a LRAP of 0.5 means the model is being completely random? — Alexander Soare, Dec 29 '20 at 20:05
An overall LRAP of near zero means your model ranks correct classes last. If that were to happen, it's a good sign because it probably means you flipped your 0s and 1s in your multihot vector. An LRAP of 0.5 can mean different things depending on its per-sample variance. For example, you could have a model that does really well at ranking some samples and then does horrible at others (high variance). You could also have a model that consistently ranks a correct class at the middle of the pack (low variance). Both of these scenarios require rechecking your data and model for issues. — Austin, Jan 26 '21 at 17:34

How to interpret: Label Ranking Average Precision Score

1 Answers1