I have a data frame where I want to calculate a count and a proportion or percentage column by 3 different factors. In this example it would be by state, gender and age.
state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )
gender <- sample(gender, 40)
age <- rep( c("Primary school","Secondary school"), 100*c(0.5,0.5) )
age <- sample(age, 40)
school.data <- data.frame(student.id, state, gender, age)
For calculating this with only 2 factors a very good solution is here: dplyr to create aggregate percentages of factor levels
But when using the code for >2 factors the solution gives incorrect values in the proportion column. Does anyone know how to find proportions when looking across at least 3 factors?
The code I tried was:
proportions <- group_by(school.data, state, gender, age) %>%
summarize(n = length(student.id)) %>%
ungroup %>% group_by(state) %>%
mutate(proportion = n / sum(n))
In the proportions df I want the proportions to be for example: Idaho female primary school vs Idaho female secondary school. So the proportion based on 1 factor when the other 2 factors are constant. And I would like to calculate those numbers across the whole df. But the proportion numbers the code generates do not match these.
I want the data to be in this format so I can create a stacked bar plot in ggplot with the option to have the count numbers or percentages printed on top of the bars like they are here Showing data values on stacked bar chart in ggplot2