Low interannotator agreement using krippendorff alpha or fleiss kappa

Question

I have 3 categories, rated by 3 annotators each. In 52% of the cases, the 3 annotators agreed on the same category and in 43% two annotators agreed on one category and in only 5% of the times, each annotator chose a different category.

I calculate fleiss's kappa or krippendorff, but the value for krippendorff is lower than the fleiss, much lower, it's 0.032 while my fleiss is 0.49.

Isn't the agreement too low, especially using krippendorff?

score 0 · Answer 1 · edited Nov 26 '21 at 19:15

Fleiß and Krippendorff implementations expect the input data to be in specific formats (rows, columns) !

Fleiss (subjects, n_categories)

Krippendorff (raters, subjects)

To get there from (subjects, raters)

For Fleiss use aggregate_raters() function from statsmodels fleiss

For Krippendorff transpose the array

If used correctly these functions will result in very similar values. If not make sure Krippendorff ‘knows’ what kind of scale (nominal, ordinal.. etc.) it is dealing with by passing the appropriate argument.

Also see the longer answers:

Inter-rater reliability calculation for multi-raters data

Is fleiss kappa a reliable measure for interannotator agreement? The following results confuses me, are there any involved assumptions while using it?

Low interannotator agreement using krippendorff alpha or fleiss kappa

1 Answers1