Inter-rater agreement in Python (Cohen's Kappa)

Question

I have ratings for 60 cases by 3 raters. These are in lists organized by document - the first element refers to the rating of the first document, the second of the second document, and so on:

rater1 = [-8,-7,8,6,2,-5,...]
rater2 = [-3,-5,3,3,2,-2,...]
rater3 = [-4,-2,1,0,0,-2,...]

Is there a python implementation of Cohen's Kappa somewhere? I couldn't find anything in numpy or scipy, and nothing here on stackoverflow, but maybe I missed it? This is quite a common statistic, so I'm surprised I can't find it for a language like Python.

I agree that it would be good to rely on some commonly used library, but implementing it yourself is not hard. My straightforward implementation is under 50 lines of code and it includes handling of missing values. — varepsilon, Apr 24 '16 at 09:29
Actually, given 3 raters cohen's kappa might not be appropriate. Since cohen's kappa measures agreement between two sample sets. For 3 raters, you would end up with 3 kappa values for '1 vs 2' , '2 vs 3' and '1 vs 3'. Which might not be easy to interpret — alvas, Jan 31 '17 at 03:08

score 26 · Answer 1 · answered Dec 07 '16 at 21:54

26

Cohen's kappa was introduced in scikit-learn 0.17:

sklearn.metrics.cohen_kappa_score(y1, y2, labels=None, weights=None)

Example:

from sklearn.metrics import cohen_kappa_score
labeler1 = [2, 0, 2, 2, 0, 1]
labeler2 = [0, 0, 2, 2, 0, 2]
cohen_kappa_score(labeler1, labeler2)

As a reminder, from {1}:

References:

{1} Viera, Anthony J., and Joanne M. Garrett. "Understanding interobserver agreement: the kappa statistic." Fam Med 37, no. 5 (2005): 360-363. https://www.ncbi.nlm.nih.gov/pubmed/15883903:

answered Dec 07 '16 at 21:54

Franck Dernoncourt

77,520
72
342
501

There are also notes here: http://scikit-learn.org/stable/modules/model_evaluation.html#cohen-kappa – Ando Jurai Aug 31 '17 at 14:21
12

but it only handles two raters, whereas the question is about three raters. – skjerns Mar 20 '18 at 13:46

oldmonk · Answer 2 · 2018-08-06T10:15:49.703

You can also use the nltk.metrics.agreement. Below is a code snippet for the same

from nltk import agreement
rater1 = [1,1,1]
rater2 = [1,1,0]
rater3 = [0,1,1]

taskdata=[[0,str(i),str(rater1[i])] for i in range(0,len(rater1))]+[[1,str(i),str(rater2[i])] for i in range(0,len(rater2))]+[[2,str(i),str(rater3[i])] for i in range(0,len(rater3))]
ratingtask = agreement.AnnotationTask(data=taskdata)
print("kappa " +str(ratingtask.kappa()))
print("fleiss " + str(ratingtask.multi_kappa()))
print("alpha " +str(ratingtask.alpha()))
print("scotts " + str(ratingtask.pi()))

Also see the http://courses.washington.edu/cmling/lab7.html for other examples

score 7 · Answer 3 · answered Nov 20 '21 at 17:22

To expand on Franck Dernoncourt answer and address skjerns comment here is the code to create a matrix for more than two raters:

import itertools

from sklearn.metrics import cohen_kappa_score
import numpy as np

# Note that I updated the numbers so all Cohen kappa scores are different.
rater1 = [-8, -7, 8, 6, 2, -5]
rater2 = [-3, -5, 3, 3, 2, -2]
rater3 = [-4, -2, 1, 3, 0, -2]

raters = [rater1, rater2, rater3]

data = np.zeros((len(raters), len(raters)))
# Calculate cohen_kappa_score for every combination of raters
# Combinations are only calculated j -> k, but not k -> j, which are equal
# So not all places in the matrix are filled.
for j, k in list(itertools.combinations(range(len(raters)), r=2)):
    data[j, k] = cohen_kappa_score(raters[j], raters[k])

# [[0.        , 0.11764706, 0.        ],
#  [0.        , 0.        , 0.25      ],
#  [0.        , 0.        , 0.        ]]

Here is a plot of data:

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(
    data, 
    mask=np.tri(len(raters)),
    annot=True, linewidths=5,
    vmin=0, vmax=1,
    xticklabels=[f"Rater {k + 1}" for k in range(len(raters))],
    yticklabels=[f"Rater {k + 1}" for k in range(len(raters))],
)
plt.show()

mikkom · Answer 4 · 2015-08-10T14:11:59.003

3

Old question but for the reference Kappa can be found at skll metrics package.

http://skll.readthedocs.org/en/latest/api/metrics.html#skll.metrics.kappa

edited Aug 10 '15 at 14:11

answered May 07 '14 at 10:34

mikkom

3,521
5
25
39

score 2 · Answer 5 · edited Nov 16 '20 at 15:24

2

statsmodels is a python library which has Cohen's Kappa and other inter-rater agreement metrics (in statsmodels.stats.inter_rater).

edited Nov 16 '20 at 15:24

sophros

14,672
11
46
75

answered Jul 15 '16 at 18:37

foobarbecue

6,780
4
28
54

score 1 · Answer 6 · answered Jul 17 '12 at 18:04

1

I haven't found it included in any major libs, but if you google around you can find implementations on various "cookbook"-type sites and the like. Here are pages with implementations of Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha

answered Jul 17 '12 at 18:04

BrenBarn

242,874
37
412
384

As others have pointed out, Kappa is part of scikit-learn, statsmodel and nltk. – fotis j May 06 '18 at 13:06

Inter-rater agreement in Python (Cohen's Kappa)

6 Answers6