Apply calculation to dataframe grouped by categorical variable

Question

Probably a duplicate but I've been unable to find a simple instantiation of this question.

I have a dataframe, DF:

     Event ID Objective.Bi Subjective.Bi Confidence   Outcome Conf.Bin
1         1            0             0         80   Correct    80-89
2         2            0             1         50 Incorrect    50-59
3         3            0             1         60 Incorrect    60-69
4         4           NA             0         80      <NA>    80-89
5         5            0             1         30 Incorrect    30-39
6         6            0             0         60   Correct    60-69
7         7            1             0         80 Incorrect    80-89
8         8            0             0         10   Correct    10-19
9         9            1             0         10 Incorrect    10-19
10       10            0             0         50   Correct    50-59
11       11            1             1         90   Correct   90-100
12       12            0             1         50 Incorrect    50-59
13       13            1             0         80 Incorrect    80-89
14       14            0             0         50   Correct    50-59
15       15            1             1         10   Correct    10-19
16       16            1             1         20   Correct    20-29
17       17            1             0         80 Incorrect    80-89
18       18            1             1         50   Correct    50-59
19       19            1             1         20   Correct    20-29
20       20            1             1         99   Correct   90-100
21       21            1             0         90 Incorrect   90-100
22       22            0             0         90   Correct   90-100
23       23           NA             1         10      <NA>    10-19
24       24            1             0         20 Incorrect    20-29
25       25            0             0         80   Correct    80-89
26       26            0             0         80   Correct    80-89
27       27            0             0         50   Correct    50-59
28       28            0             0         50   Correct    50-59
29       29           NA             1         60      <NA>    60-69
30       30            1             1         70   Correct    70-79

I want to group the data by the Conf.Bin variable, and then calculate the proportion of Correct Outcome values in each group (i.e., %.Correct = number of correct outcomes in group / number of observations in group). For example, my desired output would look like this:

   Conf.Bin  %.Correct
1     10-19       50.0
2     20-29       66.7
3     30-39       00.0
...

What's the simplest way to go about this? I've used group_by from dplyr in the past but am unsure how to apply this manual calculation to each group to produce the desired outcome.

How about [Relative frequencies / proportions with dplyr](https://stackoverflow.com/questions/24576515/relative-frequencies-proportions-with-dplyr) ? — deepseefan, Oct 30 '19 at 16:48
Thanks @deepseefan, I was able to sort it out with the post you linked to. I'll post an answer shortly. — user72716, Oct 30 '19 at 17:03

score 0 · Accepted Answer · answered Oct 30 '19 at 17:12

I was able to sort this out by adapting the script from this previous post: Relative frequencies / proportions with dplyr

This use of dplyr generates a dataframe with relative frequencies for each Outcome in each group of Conf.Bin:

DF.Correct<- as.data.frame(DF %>% 
  group_by(Conf.Bin, Outcome) %>%
  summarise(n = n()) %>%
  mutate(freq = n/ sum(n)))

head(DF.Correct)
#  Conf.Bin   Outcome n      freq
#1    10-19      <NA> 1 0.2500000
#2    10-19   Correct 2 0.5000000
#3    10-19 Incorrect 1 0.2500000
#4    20-29   Correct 2 0.6666667
#5    20-29 Incorrect 1 0.3333333
#6    30-39 Incorrect 1 1.0000000

But since I'm only interested in the frequency of Correct Outcome values in each group, we simply subset DF.Correct:

DF.Correct <- filter(DF.Correct, Outcome == "Correct")

head(DF.Correct)
#  Conf.Bin Outcome n      freq
#1    10-19 Correct 2 0.5000000
#2    20-29 Correct 2 0.6666667
#3    50-59 Correct 5 0.7142857
#4    60-69 Correct 1 0.3333333
#5    70-79 Correct 1 1.0000000
#6    80-89 Correct 3 0.4285714

NOTE: I included observations of NA in the calculation of relative frequencies here.

Apply calculation to dataframe grouped by categorical variable

1 Answers1