0

I am working on a dataset that contains predicted label (predicted) vs. true label (label) for each id and a column indicating whether the predicted label equals true label (match). I want to show the percentage of correct prediction for each label versus the total number of observations belonging to that label.

As an example, given the following data:

id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
label <- c(6, 5, 1, 5, 4, 2, 3, 1, 6, 1)
predicted <- c(6, 5, 1, 3, 2, 2, 3, 1, 4, 4)
match <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 0)
dt <- data.frame(id, label, predicted, match)
head(dt)
  id label predicted match
1  1     6         6     1
2  2     5         5     1
3  3     1         1     1
4  4     5         3     0
5  5     4         2     0
6  6     2         2     1

If I group_by(label) and count(label, predicted) and then mutate(percent = sum(match == 1)/sum(n)), it is expected that I should obtain a new grouped data frame like this

library(plyr)
library(dplyr)
dt %>% group_by(label) %>% dplyr::count(label, predicted) %>% mutate(percent = sum(match == 1)/sum(n))

dt
   id label predicted match percent
1   3     1         1     1    0.67
2   8     1         1     1    0.67
3  10     1         4     0    0.67
4   6     2         2     1    1.00
5   7     3         3     1    1.00
6   5     4         2     0    0.00
7   4     5         3     0    0.50
8   2     5         5     1    0.50
9   9     6         4     0    0.50
10  1     6         6     1    0.50

However, my code gives me this following output instead

dt
# A tibble: 6 x 4
# Groups:   label [5]
  label predicted     n percent
  <dbl>     <dbl> <int>   <dbl>
1  1.00      1.00     2   0.600
2  1.00      4.00     1   0.600
3  2.00      2.00     1   0.600
4  3.00      3.00     1   0.600
5  4.00      2.00     1   0.600
6  5.00      3.00     1   0.600

It calculated the percentage of correct prediction for "all" label (hence, all equals 0.600) instead of doing that for each label. How should I modify my code to achieve my desired output?

Chris T.
  • 1,699
  • 7
  • 23
  • 45

1 Answers1

1

I wasn't able to reproduce your output with the code that you shared. I think the following will accomplish what you are seeking, though (I used total as the variable name rather than n):

dt %>% 
  arrange(label) %>% 
  group_by(label) %>% 
  mutate(total = n(), 
         percent = sum(match == 1) / total)
# A tibble: 10 x 6
# Groups:   label [6]
      id label predicted match total percent
   <dbl> <dbl>     <dbl> <dbl> <int>   <dbl>
 1     3     1         1     1     3   0.667
 2     8     1         1     1     3   0.667
 3    10     1         4     0     3   0.667
 4     6     2         2     1     1   1    
 5     7     3         3     1     1   1    
 6     5     4         2     0     1   0    
 7     2     5         5     1     2   0.5  
 8     4     5         3     0     2   0.5  
 9     1     6         6     1     2   0.5  
10     9     6         4     0     2   0.5 
Will Oldham
  • 704
  • 3
  • 13
  • Hi, thanks for your reply, I received an error message `Error: This function should not be called directly` when I tried to replicate your code. – Chris T. May 16 '19 at 19:44
  • It turns out that I need to add `dplyr::` before `mutate` as this is the only way to work on my Mac. Thanks so much for the help! – Chris T. May 16 '19 at 19:55
  • Try in fresh session without loading plyr; you only need dplyr. I think that may be the problem. I usually will just load the tidyverse. Also more info [here](https://stackoverflow.com/questions/22801153/dplyr-error-in-n-function-should-not-be-called-directly) in case you have other dependencies/requirements. – Will Oldham May 16 '19 at 19:58