1

I've written a model that predicts on ordinal data. At the moment, I'm evaluating my model using quadratic cohen's kappa. I'm looking for a way to visualize the results using a confusion matrix, then calculate recall, precision and f1 score taking into account the prediction distance.

I.E predicting 2 when class was 1 is better than predicting 3 when class was 1.

I've written the following code to plot and calculate the results:

def plot_cm(df, ax):
    cf_matrix = confusion_matrix(df.x, df.y,normalize='true',labels=[0,1,2,3,4,5,6,7,8]) 
    
    ax = sns.heatmap(cf_matrix, linewidths=1, annot=True, ax=ax, fmt='.2f')
    ax.set_ylabel(f'Actual')
    ax.set_xlabel(f'Predicted')

    print(f'Recall score:',recall_score(df.x,df.y, average= 'weighted',zero_division=0))
    print(f'Precision score:',precision_score(df.x,df.y, average= 'weighted',zero_division=0))
    print(f'F1 score:',f1_score(df.x,df.y, average= 'weighted',zero_division=0))

enter image description here

Recall score: 0.53505
Precision score: 0.5454783454981732
F1 score: 0.5360650278722704

The visualization is fine, however, the calculation ignores predictions that where "almost" true. I.E predicted 8 when actual was 9 (for example).

Is there a way to calculate Recall, Precision and F1 taking into account the ordinal behavior of the data?

Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186
  • In short, no. You need to design your own metric. It won't a be precision and recall (by definition), but will have the same properties (and that's what I guess you are after). I'd consider L1 norm, i.e. |`true value` - `predicted value`|. – Lukasz Tracewski Aug 02 '21 at 11:08
  • Thank you for your reply, can you elaborate a bit about L1 norm? – Shlomi Schwartz Aug 02 '21 at 11:33
  • That's essentially what I wrote: absolute value of difference between the "true" and "predicted". When they match, you have 0 - perfect match. The further they are apart, the greater this number becomes, i.e. the bigger mistake / penalty. The best case scenario is 0 across all classes, the worst you can compute too. That means you can normalise your score, have it between 0 and 1. That's where you can apply precision / recall formalism to get relevant metrics. – Lukasz Tracewski Aug 02 '21 at 12:06
  • I could try to propose something along these lines, but mind it won't be a "precision" / "recall". It won't have the same properties. It will be informative though and allow you to compare performance between classes. – Lukasz Tracewski Aug 03 '21 at 07:09
  • I would live to see your idea, BTW have a look here:https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826 – Shlomi Schwartz Aug 03 '21 at 08:26
  • What about this link? Again, whatever I am going to propose, can only have some properties of precision / recall. I know how to compute those :). – Lukasz Tracewski Aug 03 '21 at 11:24

1 Answers1

2

A regular Precision (for class) is calculated as ratio of True Positives over Totally Detected for that class. Usually Truly Positive detection is defined in a binary fashion: you either correctly detected the class or not. There is no constriction whatsoever to make TP detection score for sample i fuzzy (or in other words lightly penalize close-to-class detections and make the penalty more severe as the difference grows):

TP(i) = max(0, (1 - abs(detected_class(i) - true_class(i))/penalty_factor) )

where TP_i is a value of "true positive detection" for samle i, and would be some number between [0,1] - this is . It is reasonable to make penalty_factor equal to the number of classes (it should be larger than 1). By changing it you can control how much "distant" classes would be penalized. For example if you decide that difference of more than 3 is enough to consider "not detected", set it to 3. If you set it to 1, you will get back to the "regular" precision formulation. I'm using max() to make sure that TP score will not become negative.

Now, to get the denominator right, you need to set it to the count of samples that got TP(i)>0. That is if you have a total 100 samples, and out of those 5 were detected with TP detection score of 1, and 6 got TP detection score 0.5, your Precision would be (5 + 6*0.5)/(5+6).

One issue here is that "precision per class" becomes meaningless as any class becomes somehow relevant to all classes, and if you need total precision "weighted" by class (for unbalanced classes case), you need to factor it in TP score considering true class of the sample i.

Employing the same logic, the Recall would be the sum of TP scores over the relevant population, i.e.

R = (sum of (weighted) TP scores)/(total amount of samples)

And, finally, F1 is a harmonic mean of Precision and Recall.

igrinis
  • 12,398
  • 20
  • 45