Is fleiss kappa a reliable measure for interannotator agreement? The following results confuses me, are there any involved assumptions while using it?

Question

I have annotation matrix with following description: 3 Annotators, 3 categories, 206 subjects

The data is stored in a numpy.ndarray variable z:

array([[ 0.,  2.,  1.],
   [ 0.,  2.,  1.],
   [ 0.,  2.,  1.],
   [ 0.,  2.,  1.],
   [ 1.,  1.,  1.],
   [ 0.,  2.,  1.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.]])

As can be seen 200 out of 206 annotations are for the same categories by all three annotators. Now implementing the Fleiss Kappa:

from statsmodels.stats.inter_rater import fleiss_kappa
fleiss_kappa(z)
0.062106000466964177

Why is the score so low in spite majority subjects (200/206) are annotated for the same category?

DieseRobin · Answer 1 · 2021-11-26T15:25:10.527

I am using 'rater' instead of 'annotator'. Remember these are measures of agreement above chance between raters.

Fleiss' kappa

For Fleiss kappa you need to aggregate_raters() first: this takes you from subjects as rows and raters as columns (subjects, raters) to -> subjects as rows and categories as columns (subjects, categories)

from statsmodels.stats import inter_rater as irr
agg = irr.aggregate_raters(arr) # returns a tuple (data, categories)
agg

Each row values will add up to number of raters (3), if every rater assigned one category per subject. Now the columns represent the categories as seen here https://en.wikipedia.org/wiki/Fleiss'_kappa#Data

(array([[1, 1, 1, 0],   # all three raters disagree
        [1, 1, 1, 0],   # again
        [1, 1, 1, 0],   # and again
        [1, 1, 1, 0],   # and again
        [0, 3, 0, 0],   # all three raters agree on 1
        [1, 1, 1, 0],   
        [2, 0, 0, 1],   # two raters agree on 0, third picked 3
        [2, 0, 0, 1],   # perfect disagreement 
        [2, 0, 0, 1],   # for the rest of the dataset.
        [2, 0, 0, 1],  . . . ),           
 array([0, 1, 2, 3]))   # categories

perfect disagreement: 'every time I choose 0, you choose 3'

…your data says that you have 4 categories [0, 1, 2, 3]

For the first 4 subjects each rater coded a different category! Then for the remaining subjects rater one and three agree on category 0 while rater two rated 3s. Now this is perfect disagreement for most of the dataset so I would not be surprised to see a negative alpha or kappa! Lets have a look... we only need the aggregated data agg[0] (first part of the tuple).

irr.fleiss_kappa(agg[0], method='fleiss')

-0.44238 …which makes sense given the disagreement on most subjects

Krippendorff alpha

The current krippendorff implementation expects raters as rows and subjects as columns (raters, subjects). So we need to transpose the original data. If we do not do this then 206 raters are assumed to have rated 3 subjects with four categories [0,1,2,3] resulting in the answer given previously (0.98). Krippendorff does not expect the aggregated format!

import numpy as np
import krippendorff as kd 
arrT = np.array(arr).transpose()  #returns a list of three lists, one per rater
kd.alpha(arrT, level_of_measurement='nominal')  #assuming nominal categories

-0.4400 …which makes sense, because it should be close to / equal to Fleiss' kappa.

score 0 · Answer 2 · answered Oct 04 '19 at 21:18

This solution may serve your purpose. Here's code snippet for you which yields kappa score 0.98708.

import krippendorff

arr = [[ 0,  2,  1],
   [ 0,  2,  1],
   [ 0,  2,  1],
   [ 0,  2,  1],
   [ 1,  1,  1],
   [ 0,  2,  1],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0]]

kappa = krippendorff.alpha(arr)
print(kappa)

It works with Python 3.4+ and here are the dependencies you need to install

pip install numpy krippendorff

score 0 · Answer 3 · answered Feb 25 '21 at 14:30

0

I also want to calculate Fleiss's kappa or krippendorff. ut, the value for krippendorff is lower than the Fleiss, much lower, it's 0.032 while my fleiss is 0.49.

I have 3 categories, rated by 3 annotators each. In 52% of the cases, the 3 annotators agreed on the same category and in 43% two annotators agreed on one category and in only 5% of the times, each annotator chose a different category. Isn't the agreement too low, especially using krippendorff?

answered Feb 25 '21 at 14:30

user3636159

1
1

This is not an answer to the original question. StackOverflow questions are not forum topics. Please post individual questions. – Sid Kwakkel Feb 25 '21 at 17:01

MoJo494 · Answer 4 · 2023-01-12T10:22:14.277

I think the statsmodels score is perfectly fine. The problem of your example is that the second category is picked almost all the time. This implies by definition of Fleiss Kappa that two random raters both pick the second category by chance is very high. Mathematically, following the notation of the paper of the wikipedia article (which exactly matches the paper), Fleiss Kappa is defined as:

k=(\bar{P}-\bar{P_e})/(1-\bar{P_e})

where

\bar{P} is the actual probability that the rating of two random raters agree
bar{P_e} is the probability that the rating of two random raters agree by chance

In your case, \bar{P} and (and thats the problem) \bar{P_e} are close to 1.

A solution to your problem would be that the raters agree also on the other two categories. So for example change your example in a way that you have 306 subjects, with still 3 categories and 3 raters. Lets assume that the annotations of the first 6 subjects are similar to your example. Then for the next 100 subjects all 3 raters agree on category 1. For the next 100 subjects all 3 raters agree on category 2. Accordingly all raters agree on category 3 for the last 100 subjects. Now the probability that two raters end up with the same rating by chance is much lower, since the overall number of ratings per category is much more balanced. In this exact example, the Fleiss Kappa is 0.9787

from statsmodels.stats.inter_rater import fleiss_kappa
import numpy as np


firstSixAnnotations=np.array([[ 0.,  2.,  1.],
   [ 0.,  2.,  1.],
   [ 0.,  2.,  1.],
   [ 0.,  2.,  1.],
   [ 1.,  1.,  1.],
   [ 0.,  2.,  1.]])
allRatersAgreeOnFirstCategory=np.tile(np.array([3.,0.,0.]), (100,1))
allRatersAgreeOnSecondCategory=np.tile(np.array([0.,3.,0.]), (100,1))
allRatersAgreeOnThirdCategory=np.tile(np.array([0.,0.,3.]), (100,1))
z=np.concatenate([firstSixAnnotations,allRatersAgreeOnFirstCategory,allRatersAgreeOnSecondCategory,allRatersAgreeOnThirdCategory])
print(fleiss_kappa(z))

Is fleiss kappa a reliable measure for interannotator agreement? The following results confuses me, are there any involved assumptions while using it?

4 Answers4

Fleiss' kappa

…your data says that you have 4 categories [0, 1, 2, 3]

Krippendorff alpha

Linked