Computing macro f1 score using sklearn

Question

I am using sklearn to compute macro f1 score and I doubt if there are any bugs in the code. Here is an example (label 0 is ignored):

from sklearn.metrics import f1_score, precision_recall_fscore_support

y_true = [1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4]
y_pred = [1, 1, 1, 0, 0, 2, 2, 3, 3, 3, 4, 3, 4, 3]



p_macro, r_macro, f_macro, support_macro \
    = precision_recall_fscore_support(y_true=y_true, y_pred=y_pred, labels=[1, 2, 3, 4], average='macro')

p_micro, r_micro, f_micro, support_micro\
    = precision_recall_fscore_support(y_true=y_true, y_pred=y_pred, labels=[1, 2, 3, 4], average='micro')

def f(p, r):
    return 2*p*r/(p+r)

my_f_macro = f(p_macro, r_macro)

my_f_micro = f(p_micro, r_micro)

print('my f macro {}'.format(my_f_macro))

print('my f micro {}'.format(my_f_micro))

print('macro: p {}, r {}, f1 {}'.format(p_macro, r_macro, f_macro))

print('micro: p {}, r {}, f1 {}'.format(p_micro, r_micro, f_micro))

The output:

my f macro 0.6361290322580646
my f micro 0.6153846153846153
macro: p 0.725, r 0.5666666666666667, f1 0.6041666666666666
micro: p 0.6666666666666666, r 0.5714285714285714, f1 0.6153846153846153

As you can see, sklearn gives 0.6041666666666666 for macro f1. However, it does not equal to 2*0.725*0.566666666/(0.725+0.566666666), where 0.725 and 0.566666666 are macro precision and macro recall computed by sklearn.

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

2

There's a difference in procedure to calculate 'macro' and 'micro' averages.

As given in the documentation of f_score:

'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

In macro, the recall, precision and f1 for all classes are computed individually and then their mean is returned. So you cannot expect to apply your formula def f(p, r) on them. Because they are not the same thing as you intended.

In micro, the f1 is calculated on the final precision and recall (combined global for all classes). So that is matching the score that you calculate in my_f_micro.

Hope it makes sense.

For more explanation, you can read the answer here:-

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 15 '17 at 03:37

Vivek Kumar

35,217
8
109
132

Yes, macro makes sense, but what about micro - can it only be used for boolean categoricals with one category to "recall" and be "precise" about? What's false positive and what's false negative for multiple classes? – Tomasz Gandor Oct 30 '20 at 21:24
@TomaszGandor Only difference is when the actual precision, recall and f1 is calculated. We first calculate tp, tn, fp, fn class wise. In micro, we sum all tp (for all classes), same for others and then calculate the metrics. In macro, we first calculate the metrics for each class, and then just take the average of them. Hope this makes it clear. – Vivek Kumar Dec 28 '20 at 09:24

Computing macro f1 score using sklearn

1 Answers1