17

I have ratings for 60 cases by 3 raters. These are in lists organized by document - the first element refers to the rating of the first document, the second of the second document, and so on:

rater1 = [-8,-7,8,6,2,-5,...]
rater2 = [-3,-5,3,3,2,-2,...]
rater3 = [-4,-2,1,0,0,-2,...]

Is there a python implementation of Cohen's Kappa somewhere? I couldn't find anything in numpy or scipy, and nothing here on stackoverflow, but maybe I missed it? This is quite a common statistic, so I'm surprised I can't find it for a language like Python.

Zach
  • 4,624
  • 13
  • 43
  • 60
  • I agree that it would be good to rely on some commonly used library, but implementing it yourself is not hard. My straightforward implementation is under 50 lines of code and it includes handling of missing values. – varepsilon Apr 24 '16 at 09:29
  • 4
    Actually, given 3 raters cohen's kappa might not be appropriate. Since cohen's kappa measures agreement between two sample sets. For 3 raters, you would end up with 3 kappa values for '1 vs 2' , '2 vs 3' and '1 vs 3'. Which might not be easy to interpret – alvas Jan 31 '17 at 03:08
  • Fleiss' Kappa is the choice for 3 raters – Doc Brown Jul 17 '19 at 06:45

6 Answers6

26

Cohen's kappa was introduced in scikit-learn 0.17:

sklearn.metrics.cohen_kappa_score(y1, y2, labels=None, weights=None)

Example:

from sklearn.metrics import cohen_kappa_score
labeler1 = [2, 0, 2, 2, 0, 1]
labeler2 = [0, 0, 2, 2, 0, 2]
cohen_kappa_score(labeler1, labeler2)

As a reminder, from {1}:

enter image description here


References:

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
19

You can also use the nltk.metrics.agreement. Below is a code snippet for the same

from nltk import agreement
rater1 = [1,1,1]
rater2 = [1,1,0]
rater3 = [0,1,1]

taskdata=[[0,str(i),str(rater1[i])] for i in range(0,len(rater1))]+[[1,str(i),str(rater2[i])] for i in range(0,len(rater2))]+[[2,str(i),str(rater3[i])] for i in range(0,len(rater3))]
ratingtask = agreement.AnnotationTask(data=taskdata)
print("kappa " +str(ratingtask.kappa()))
print("fleiss " + str(ratingtask.multi_kappa()))
print("alpha " +str(ratingtask.alpha()))
print("scotts " + str(ratingtask.pi()))

Also see the http://courses.washington.edu/cmling/lab7.html for other examples

oldmonk
  • 691
  • 9
  • 16
7

To expand on Franck Dernoncourt answer and address skjerns comment here is the code to create a matrix for more than two raters:

import itertools

from sklearn.metrics import cohen_kappa_score
import numpy as np

# Note that I updated the numbers so all Cohen kappa scores are different.
rater1 = [-8, -7, 8, 6, 2, -5]
rater2 = [-3, -5, 3, 3, 2, -2]
rater3 = [-4, -2, 1, 3, 0, -2]

raters = [rater1, rater2, rater3]

data = np.zeros((len(raters), len(raters)))
# Calculate cohen_kappa_score for every combination of raters
# Combinations are only calculated j -> k, but not k -> j, which are equal
# So not all places in the matrix are filled.
for j, k in list(itertools.combinations(range(len(raters)), r=2)):
    data[j, k] = cohen_kappa_score(raters[j], raters[k])

# [[0.        , 0.11764706, 0.        ],
#  [0.        , 0.        , 0.25      ],
#  [0.        , 0.        , 0.        ]]

Here is a plot of data:

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(
    data, 
    mask=np.tri(len(raters)),
    annot=True, linewidths=5,
    vmin=0, vmax=1,
    xticklabels=[f"Rater {k + 1}" for k in range(len(raters))],
    yticklabels=[f"Rater {k + 1}" for k in range(len(raters))],
)
plt.show()

heatmap

JulianWgs
  • 961
  • 1
  • 14
  • 25
3

Old question but for the reference Kappa can be found at skll metrics package.

http://skll.readthedocs.org/en/latest/api/metrics.html#skll.metrics.kappa

mikkom
  • 3,521
  • 5
  • 25
  • 39
2

statsmodels is a python library which has Cohen's Kappa and other inter-rater agreement metrics (in statsmodels.stats.inter_rater).

sophros
  • 14,672
  • 11
  • 46
  • 75
foobarbecue
  • 6,780
  • 4
  • 28
  • 54
1

I haven't found it included in any major libs, but if you google around you can find implementations on various "cookbook"-type sites and the like. Here are pages with implementations of Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha

BrenBarn
  • 242,874
  • 37
  • 412
  • 384