I found the irr
package has 2 big bugs for the calculation of weighted kappa
.
Please tell me if the 2 bugs are really there or I misunderstood someting.
You can replicate the bugs using the following examples.
First bug: The sort of labels in confusion matrix needs to be corrected.
I have 2 pairs of scores for disease extent (from 0 to 100 while 0 is healthy, 100 is extremely ill).
In label_test.csv
(you can just copy and paste the data to your disk to do the following test):
0
1
1
1
0
14
53
3
In pred_test.csv
:
0
1
1
0
3
4
54
6
in script_r.R
:
library(irr)
label <- read.csv('label_test.csv',header=FALSE)
pred <- read.csv('pred_test.csv',header=FALSE)
kapp <- kappa2(data.frame(label,pred),"unweighted")
kappa <- getElement(kapp,"value")
print(kappa) # output: 0.245283
w_kapp <- kappa2(data.frame(label,pred),"equal")
weighted_kappa <- getElement(w_kapp,"value")
print(weighted_kappa) # output: 0.443038
When I use Python
to calculate the kappa
and weighted_kappa
, in script_python.py
:
from sklearn.metrics import cohen_kappa_score
label = pd.read_csv(label_file, header=None).to_numpy()
pred = pd.read_csv(pred_file, header=None).to_numpy()
kappa = cohen_kappa_score(label.astype(int), pred.astype(int))
print(kappa) # output: 0.24528301886792447
weighted_kappa = cohen_kappa_score(label.astype(int), pred.astype(int), weights='linear', labels=np.array(list(range(100))) )
print(weighted_kappa) # output: 0.8359908883826879
We can find that the kappa
calculated by R
and Python
is the same, but the weighted_kappa
from R
is far lower than the weighted_kappa
in sklearn
from Python
. Which is wrong? After 2-day research, I found that the weighted_kappa
from irr
package in R
is wrong. Details are as follows.
During the debuging, we will find the confusion matrix in irr
from R
is:
We can find that the order is wrong. The order of labels should be changed from [0, 1, 14, 3, 4, 53, 54, 6]
to [0, 1, 3, 4, 6, 14, 53, 54]
as it is in Python. It seems that irr
package used a string-based sort method instead of integer-based sort method, which will put 14
to the front of 3
. This mistake could be and should be corrected easily.
Second bug: The confusion matrix is not complete in R
In my pred_test.csv
and label_test.csv
, the values can not cover all possible values from 0 to 100. So the default confusion matrix in irr
from R
will miss those values which does not appear in data. This should be fixed.
Let's see another example.
In pred_test.csv
, let's change the label from 54
to 99
. Then, we run script_r.R
and script_python.py
again. The results are:
In R:
kappa: 0.245283
weighted_kappa: 0.443038
In Python:
kappa: 0.24528301886792447
weighted_kappa: 0.592891760904685
We can find the weighted_kappa
from irr
in R
is unchanged at all. But the weighted_kappa
from sklearn
in Python
is decreased from 0.83
to 0.59
. So we know irr
made a mistake again.
The reason is that sklearn
can let us to pass the full labels
to the confusion matrix so that the confusion matrix shape will be 100 * 100, however in irr
, the labels of confusion matrix is calculated from the unique values from label
and pred
, which will miss a lot of other possible values. This mistake will assign the same weight to 53
and 99
here. So it is better to provide an option in irr
package to let custumer provide the custum labels
like what they have done in sklearn
from Python
.