0

I am trying to work with smaller samples because brms and rstan models would take otherwise forever to run on my complete datasets. In order to do that, I need to reduce my sample size by drawing a number of X random schools from all the countries. It is a nested dataset, where students are in classes, nested in schools, nested in countries.

I cannot draw randomly schools from the dataset since this would risk not including all the countries, and since they are similarly labeled between countries, i also risk not knowing what is going on.

Here how it looks in a simulated way. Note that schools are not equal in sizes as here.

mydata <- data.frame(country =c(rep("Germany", 20), rep("Italy", 20),rep("France", 20)),
                 school  =c(rep("A", 5), rep("B", 5), rep("C", 5), rep("D", 5)),
                 student.age     = sample(18:30, 60, replace=TRUE),
                 var1            = rnorm(60,0,1))

There is a function i found online, here, that does the job, but only for the first grouping factor, i.e., countries: it selects randomly x number of countries with all their rows.

sample_n_groups = function(tbl, size, replace = FALSE, weight = NULL) {
  # regroup when done
  grps = tbl %>% groups %>% lapply(as.character) %>% unlist
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% ungroup() %>% sample_n(size, replace, weight)
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(.dots = grps)
}

But applying it on a second level of grouping (to take random samples of groups within countries, it fails.

mydata %>% 
  group_by(country, school) %>% 
  sample_n_groups(2)

So my question is: Is there a tidyverse (or any other way in R) way of doing it? To be more verbose, i need for example, 2 random schools from all the countries, Italy, Germany and so on. The schools vary in size and are almost always coded the same between the countries.

George GL
  • 29
  • 3
  • 1
    Does this answer your question? [Randomly sample groups](https://stackoverflow.com/questions/37149649/randomly-sample-groups) – DPH Dec 24 '21 at 15:52
  • No, since the answers have mainly two problems: they either do not return full groups (they sample within the group), or they ignore the nested nature of the data. In the case i presented, I need to sample full groups that are within categories. Thank you, nonetheless. – George GL Dec 25 '21 at 10:51
  • @GeorgeGL You can try this: `mydata %>% group_by(country, school) %>% slice_sample() %>% group_by(country) %>% slice_sample(n = 2)` – jpdugo17 Dec 26 '21 at 21:21

2 Answers2

0

library(tidyverse)
df %>% 
  group_by(Country) %>% 
  slice_sample(n = 3)
#> # A tibble: 6 x 4
#> # Groups:   Country [2]
#>   Country School Class Student
#>   <chr>    <int> <int>   <int>
#> 1 Germany      1     2       1
#> 2 Germany      1     1       1
#> 3 Germany      1     1       2
#> 4 Italy        1     2       1
#> 5 Italy        1     2       3
#> 6 Italy        2     2       1

# or

library(sampling)
sample_strata <- strata(
  data = df,
  stratanames = c("Country"),
  size = c(2, 3),
  method = "srswor"
)

sample_strata
#>    Country ID_unit      Prob Stratum
#> 3    Italy       3 0.2857143       1
#> 4    Italy       4 0.2857143       1
#> 9  Germany       9 0.6000000       2
#> 11 Germany      11 0.6000000       2
#> 12 Germany      12 0.6000000       2

df[sample_strata$ID_unit, ]
#>    Country School Class Student
#> 3    Italy      1     2       1
#> 4    Italy      1     2       2
#> 9  Germany      1     1       2
#> 11 Germany      2     1       1
#> 12 Germany      2     2       1

Created on 2021-12-24 by the reprex package (v2.0.1)

data

df <- structure(
  list(
    Country = c(
      "Italy",
      "Italy",
      "Italy",
      "Italy",
      "Italy",
      "Italy",
      "Italy",
      "Germany",
      "Germany",
      "Germany",
      "Germany",
      "Germany"
    ),
    School = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L),
    Class = c(1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L),
    Student = c(1L, 2L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 1L, 1L, 1L)
  ),
  class = "data.frame",
  row.names = c(NA,-12L)
)
Yuriy Saraykin
  • 8,390
  • 1
  • 7
  • 14
0

Another method using data.table

library(data.table)
setDT(df)

df[df[ , .I[sample(.N, 3)] , by = Country]$V1]
Merijn van Tilborg
  • 5,452
  • 1
  • 7
  • 22