Is there a way to draw random subgroups from all groups in a nested dataset in R?

Question

I am trying to work with smaller samples because brms and rstan models would take otherwise forever to run on my complete datasets. In order to do that, I need to reduce my sample size by drawing a number of X random schools from all the countries. It is a nested dataset, where students are in classes, nested in schools, nested in countries.

I cannot draw randomly schools from the dataset since this would risk not including all the countries, and since they are similarly labeled between countries, i also risk not knowing what is going on.

Here how it looks in a simulated way. Note that schools are not equal in sizes as here.

mydata <- data.frame(country =c(rep("Germany", 20), rep("Italy", 20),rep("France", 20)),
                 school  =c(rep("A", 5), rep("B", 5), rep("C", 5), rep("D", 5)),
                 student.age     = sample(18:30, 60, replace=TRUE),
                 var1            = rnorm(60,0,1))

There is a function i found online, here, that does the job, but only for the first grouping factor, i.e., countries: it selects randomly x number of countries with all their rows.

sample_n_groups = function(tbl, size, replace = FALSE, weight = NULL) {
  # regroup when done
  grps = tbl %>% groups %>% lapply(as.character) %>% unlist
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% ungroup() %>% sample_n(size, replace, weight)
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(.dots = grps)
}

But applying it on a second level of grouping (to take random samples of groups within countries, it fails.

mydata %>% 
  group_by(country, school) %>% 
  sample_n_groups(2)

So my question is: Is there a tidyverse (or any other way in R) way of doing it? To be more verbose, i need for example, 2 random schools from all the countries, Italy, Germany and so on. The schools vary in size and are almost always coded the same between the countries.

Does this answer your question? [Randomly sample groups](https://stackoverflow.com/questions/37149649/randomly-sample-groups) — DPH, Dec 24 '21 at 15:52
No, since the answers have mainly two problems: they either do not return full groups (they sample within the group), or they ignore the nested nature of the data. In the case i presented, I need to sample full groups that are within categories. Thank you, nonetheless. — George GL, Dec 25 '21 at 10:51
@GeorgeGL You can try this: `mydata %>% group_by(country, school) %>% slice_sample() %>% group_by(country) %>% slice_sample(n = 2)` — jpdugo17, Dec 26 '21 at 21:21

Yuriy Saraykin · Answer 1 · 2021-12-24T16:00:09.507


library(tidyverse)
df %>% 
  group_by(Country) %>% 
  slice_sample(n = 3)
#> # A tibble: 6 x 4
#> # Groups:   Country [2]
#>   Country School Class Student
#>   <chr>    <int> <int>   <int>
#> 1 Germany      1     2       1
#> 2 Germany      1     1       1
#> 3 Germany      1     1       2
#> 4 Italy        1     2       1
#> 5 Italy        1     2       3
#> 6 Italy        2     2       1

# or

library(sampling)
sample_strata <- strata(
  data = df,
  stratanames = c("Country"),
  size = c(2, 3),
  method = "srswor"
)

sample_strata
#>    Country ID_unit      Prob Stratum
#> 3    Italy       3 0.2857143       1
#> 4    Italy       4 0.2857143       1
#> 9  Germany       9 0.6000000       2
#> 11 Germany      11 0.6000000       2
#> 12 Germany      12 0.6000000       2

df[sample_strata$ID_unit, ]
#>    Country School Class Student
#> 3    Italy      1     2       1
#> 4    Italy      1     2       2
#> 9  Germany      1     1       2
#> 11 Germany      2     1       1
#> 12 Germany      2     2       1

^{Created on 2021-12-24 by the reprex package (v2.0.1)}

data

df <- structure(
  list(
    Country = c(
      "Italy",
      "Italy",
      "Italy",
      "Italy",
      "Italy",
      "Italy",
      "Italy",
      "Germany",
      "Germany",
      "Germany",
      "Germany",
      "Germany"
    ),
    School = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L),
    Class = c(1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L),
    Student = c(1L, 2L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 1L, 1L, 1L)
  ),
  class = "data.frame",
  row.names = c(NA,-12L)
)

slice_sample(), does not work, as it does not select whole schools... — George GL, Dec 24 '21 at 16:34

score 0 · Answer 2 · answered Dec 24 '21 at 16:09

0

Another method using data.table

library(data.table)
setDT(df)

df[df[ , .I[sample(.N, 3)] , by = Country]$V1]

answered Dec 24 '21 at 16:09

Merijn van Tilborg

5,452
1
7
22

Is there a way to draw random subgroups from all groups in a nested dataset in R?

2 Answers2