Building off of this question here: Inter-rater reliability calculation for multi-raters data
Let's say you have N units of text distributed amongst M > 2 annotators. Not all units of text are annotated by all M annotators, i.e. there are missing annotations.
Suppose each unit of text can be assigned to one of K categories.
Yes we could use Krippendorf's alpha to measure inter annotator agreement across all units of text and categories. But what if you wanted to know the reliability of annotation for each category? The point being to identify which categories have lowest reliability. How could we determine that in this situation?