Summarising Data in R

Question

I'm sure this is really easy, but as an R novice I'm tearing my hair out.

I have a data frame:

df <- data.frame("Factor_1" = c(1,2,1,1,2,1,1,2,1,2,1,2),
                 "Factor_2" = c("M", "F", "M", "F","M", "F","M", "F","M", "F","M", "F"),
                 "Denominator" = c(1,1,1,1,1,1,1,1,1,1,1,1),
                 "Numerator" = c(0,0,1,0,0,0,1,0,0,0,1,1))

I want to create some graphs:

(1) Sum(Denominator) - split by Factor_1
(2) Sum(Numerator)/Sum(Denominator) - split by Factor_1

(so Factor_1 appears on the horizontal axis)

(and then repeat for Factor_2)

Ideally with (1) and (2) having different vertical axes and with (1) as columns and (2) as lines.

Would look a bit like the attached picture (from an Excel pivot table/graph):

Pivot Graph

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. Show what you tried so far and describe exactly where you are getting stuck. — MrFlick, Nov 29 '18 at 16:12
Have edited original post to show an example of a small chunk of data. I've tried various things - plot, boxplot, ggplot, etc. but I'm afraid I just don't know what to do. I can't imagine this is a difficult thing to do though. — Alan, Nov 29 '18 at 16:32

score 0 · Answer 1 · answered Nov 29 '18 at 16:42

library(tidyverse)

df <- data.frame("Factor_1" = c(1,2,1,1,2,1,1,2,1,2,1,2),
             "Factor_2" = c("M", "F", "M", "F","M", "F","M", "F","M", "F","M", "F"),
             "Denominator" = c(1,1,1,1,1,1,1,1,1,1,1,1),
             "Numerator" = c(0,0,1,0,0,0,1,0,0,0,1,1))




df %>% group_by(Factor_1) %>% summarize(sum_num=sum(Numerator),sum_dem=sum(Denominator)) %>% mutate(ratio=sum_num/sum_dem)

A tibble: 2 x 4
Factor_1 sum_num sum_dem ratio
   <dbl>   <dbl>   <dbl> <dbl>
       1       3       7 0.429
       2       1       5 0.2

Does this help?

OTStats · Accepted Answer · 2018-11-29T18:26:00.630

Instead of thinking of this problem like a pivot in Excel, start thinking about this as a prime opportunity to use the tidyverse!

Let's set up our environment:

library(tidyverse)   # This will load dplyr and tidyverse to make visualization easier!

df <- data.frame("Factor_1" = c(1,2,1,1,2,1,1,2,1,2,1,2),   
             "Factor_2" = c("M", "F", "M", "F","M", "F","M", "F","M", "F","M", "F"),   
             "Denominator" = c(1,1,1,1,1,1,1,1,1,1,1,1),   
             "Numerator" = c(0,0,1,0,0,0,1,0,0,0,1,1))

First let's work with Factor_1. First, we want the Numerator and Denominator sums and Numerator/Denominator ratio for each Factor_1 group. We need to tell R that we want to group by Factor_1. Then we can use the summarize() function in dplyr package to do the majority of our heavy lifting.

summaryFactor1 <- df %>%                     # Save as new object: summaryFactor1
group_by(Factor_1) %>%                       # Group data by Factor_1, and for each:
summarize(sum_num = sum(Numerator),          # sum Numerator
        sum_den = sum(Denominator)) %>%      # sum Denominator
mutate(ratio = sum_num/sum_den)              # and create new column for num/den

This will give us:

summaryFactor1
#  A tibble: 2 x 4
  Factor_1 sum_num sum_den ratio
     <dbl>   <dbl>   <dbl> <dbl>
1        1       3       7 0.429
2        2       1       5 0.2

And to reproduce the graph you're looking for, we take our summaryFactor1 tibble and use ggplot:

summaryFactor1 %>%                        # Use our summary table
ggplot(aes(x = Factor_1)) +               # plot Factor_1 on x-axis, 
geom_col(aes(y = sum_den)) +              # sum_den as columns, 
geom_line(aes(y = ratio))                 # and ratio as a line

Notice there is only one y-axis, and thus the line plotting the ratio is difficult to interpret. While your desired plot that you shared from Excel looks better, be wary of the misinterpretation of that ratio.

We can use the same logic as above to Factor_2:

summaryFactor2 <- df %>%                     # Save as new object: summaryFactor1                    
group_by(Factor_2) %>%                       # Group data by Factor2, and for each:    
summarize(sum_num = sum(Numerator),          # sum Numerator
        sum_den = sum(Denominator)) %>%      # sum Denominator
mutate(ratio = sum_num/sum_den)              # and create new column for num/den

# Let's view the result
summaryFactor2
# A tibble: 2 x 4
 Factor_2 sum_num sum_den ratio
  <fct>      <dbl>   <dbl> <dbl>
1 F              1       6 0.167
2 M              3       6 0.5

Before we move on, notice that the sum of Denominator for each group is the same. When we compared ratio's within Factor_1's groups, note that the two groups have a different Denominator sums, so this is an easier 1:1 comparison.

Since plotting sum_den between the two groups won't be very insightful...

summaryFactor2
ggplot(aes(x = Factor_2)) + 
geom_col(aes(y = sum_den))

let's instead plot the ratio:

summaryFactor2 %>% 
ggplot(aes(x = Factor_2)) + 
geom_col(aes(y = ratio))

Thank you very much for taking the time to do this. I can get it to work, so this is progress. The main missing part is having a secondary y axis for the "ratio" item, which is quite important. I usually manually rescale the axes (if I'm working in Excel, for example), so that the columns of the "denominator" series are a lot smaller (ie underneath) the line of the "ratio" series - this is a convenient way to reduce the risk of being confused between the two. I may well be able to work this out for myself... — Alan, Nov 29 '18 at 19:09

Summarising Data in R

2 Answers2