I am trying to work with smaller samples because brms and rstan models would take otherwise forever to run on my complete datasets. In order to do that, I need to reduce my sample size by drawing a number of X random schools from all the countries. It is a nested dataset, where students are in classes, nested in schools, nested in countries.
I cannot draw randomly schools from the dataset since this would risk not including all the countries, and since they are similarly labeled between countries, i also risk not knowing what is going on.
Here how it looks in a simulated way. Note that schools are not equal in sizes as here.
mydata <- data.frame(country =c(rep("Germany", 20), rep("Italy", 20),rep("France", 20)),
school =c(rep("A", 5), rep("B", 5), rep("C", 5), rep("D", 5)),
student.age = sample(18:30, 60, replace=TRUE),
var1 = rnorm(60,0,1))
There is a function i found online, here, that does the job, but only for the first grouping factor, i.e., countries: it selects randomly x number of countries with all their rows.
sample_n_groups = function(tbl, size, replace = FALSE, weight = NULL) {
# regroup when done
grps = tbl %>% groups %>% lapply(as.character) %>% unlist
# check length of groups non-zero
keep = tbl %>% summarise() %>% ungroup() %>% sample_n(size, replace, weight)
# keep only selected groups, regroup because joins change count.
# regrouping may be unnecessary but joins do something funky to grouping variable
tbl %>% right_join(keep, by=grps) %>% group_by_(.dots = grps)
}
But applying it on a second level of grouping (to take random samples of groups within countries, it fails.
mydata %>%
group_by(country, school) %>%
sample_n_groups(2)
So my question is: Is there a tidyverse (or any other way in R) way of doing it? To be more verbose, i need for example, 2 random schools from all the countries, Italy, Germany and so on. The schools vary in size and are almost always coded the same between the countries.