I am working on a dataset that contains predicted label (predicted
) vs. true label (label
) for each id
and a column indicating whether the predicted label equals true label (match
). I want to show the percentage of correct prediction for each label
versus the total number of observations belonging to that label.
As an example, given the following data:
id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
label <- c(6, 5, 1, 5, 4, 2, 3, 1, 6, 1)
predicted <- c(6, 5, 1, 3, 2, 2, 3, 1, 4, 4)
match <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 0)
dt <- data.frame(id, label, predicted, match)
head(dt)
id label predicted match
1 1 6 6 1
2 2 5 5 1
3 3 1 1 1
4 4 5 3 0
5 5 4 2 0
6 6 2 2 1
If I group_by(label)
and count(label, predicted)
and then mutate(percent = sum(match == 1)/sum(n))
, it is expected that I should obtain a new grouped data frame like this
library(plyr)
library(dplyr)
dt %>% group_by(label) %>% dplyr::count(label, predicted) %>% mutate(percent = sum(match == 1)/sum(n))
dt
id label predicted match percent
1 3 1 1 1 0.67
2 8 1 1 1 0.67
3 10 1 4 0 0.67
4 6 2 2 1 1.00
5 7 3 3 1 1.00
6 5 4 2 0 0.00
7 4 5 3 0 0.50
8 2 5 5 1 0.50
9 9 6 4 0 0.50
10 1 6 6 1 0.50
However, my code gives me this following output instead
dt
# A tibble: 6 x 4
# Groups: label [5]
label predicted n percent
<dbl> <dbl> <int> <dbl>
1 1.00 1.00 2 0.600
2 1.00 4.00 1 0.600
3 2.00 2.00 1 0.600
4 3.00 3.00 1 0.600
5 4.00 2.00 1 0.600
6 5.00 3.00 1 0.600
It calculated the percentage of correct prediction for "all" label
(hence, all equals 0.600) instead of doing that for each label
. How should I modify my code to achieve my desired output?