0

I have a data frame where I want to calculate a count and a proportion or percentage column by 3 different factors. In this example it would be by state, gender and age.

state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )  
gender <- sample(gender, 40)
age <- rep( c("Primary school","Secondary school"), 100*c(0.5,0.5) )
age <- sample(age, 40)
school.data <- data.frame(student.id, state, gender, age)

For calculating this with only 2 factors a very good solution is here: dplyr to create aggregate percentages of factor levels

But when using the code for >2 factors the solution gives incorrect values in the proportion column. Does anyone know how to find proportions when looking across at least 3 factors?

The code I tried was:

proportions <- group_by(school.data, state, gender, age) %>% 
  summarize(n = length(student.id)) %>%
  ungroup %>% group_by(state) %>% 
  mutate(proportion = n / sum(n)) 

In the proportions df I want the proportions to be for example: Idaho female primary school vs Idaho female secondary school. So the proportion based on 1 factor when the other 2 factors are constant. And I would like to calculate those numbers across the whole df. But the proportion numbers the code generates do not match these.

I want the data to be in this format so I can create a stacked bar plot in ggplot with the option to have the count numbers or percentages printed on top of the bars like they are here Showing data values on stacked bar chart in ggplot2

Alicia
  • 57
  • 1
  • 9
  • I think it would be clearer if you created a *small* example (say, 10-20 rows) so that the right answer is apparent and it is easy to compare to---with an example that small you could post the right answer in the question. Also, it's nice if you use `set.seed()` before random simulation so everyone can simulate the same data. – Gregor Thomas Jan 23 '20 at 17:54

1 Answers1

1

I think you are just missing "age" in the second gropu_by. the following code seems to produce the right proportions.

library(tidyverse)
state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )  
gender <- sample(gender, 40)
age <- rep( c("Primary school","Secondary school"), 100*c(0.5,0.5) )
age <- sample(age, 40)
school.data <- data.frame(student.id, state, gender, age)

proportions <- group_by(school.data, state, gender, age)%>% 
  summarize(n = length(student.id)) %>%
  ungroup %>% group_by(state, gender) %>% 
  mutate(proportion = n / sum(n)) 

gives

State   gender  Age             n   proportion
-------------------------------------------------
Idaho   Female  Primary school  4   0.3076923
Idaho   Female  Secondary school    9   0.6923077
Idaho   Male    Primary school  2   0.2857143
Idaho   Male    Secondary school    5   0.7142857
Maine   Female  Primary school  8   0.5714286
Maine   Female  Secondary school    6   0.4285714
Maine   Male    Primary school  4   0.6666667
Maine   Male    Secondary school    2   0.3333333
Seshadri
  • 669
  • 3
  • 11