25

Given a dataframe df with a column called group, how do you randomly sample k groups from it in dplyr? It should return all rows from k groups (given there are at least k unique values in df$group), and every group in df should be equally likely to be returned.

Big Dogg
  • 2,564
  • 5
  • 21
  • 22

5 Answers5

37

Just use sample() to choose some number of groups

iris %>% filter(Species %in% sample(levels(Species),2))
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • 6
    you can use 'unique(Species)' instead of 'levels(Species)' if you have a character or numeric identifier column. – Dan Slone Apr 23 '21 at 04:52
  • 1
    That is a nice solution but I just ran into a problem here as I wish to sample with replacements which is not compatible with the %in% statement. – Daniel Münch May 19 '21 at 12:57
  • very elegant answer. If you are looking for the reverse (n entries per id), have a look here: https://stackoverflow.com/questions/18258690/take-random-sample-by-group – Samuel Saari Aug 19 '22 at 05:23
11

I think this approach makes the most sense if you are using dplyr:

iris_grouped <- iris %>% 
  group_by(Species) %>% 
  nest()

Which produces:

# A tibble: 3 x 2
  Species    data             
  <fct>      <list>           
1 setosa     <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica  <tibble [50 × 4]>

with which you can then use sample_n:

iris_grouped %>%
  sample_n(2)

# A tibble: 2 x 2
  Species    data             
  <fct>      <list>           
1 virginica  <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
Oscar
  • 349
  • 3
  • 5
  • 2
    That's great. Don't forget to `unnest()` at the end for further calculations. – Marco Dec 17 '19 at 13:28
  • 9
    I really prefer this syntax but on my large dataset this method took hours to run while @MrFlick 's answer only took a second. – chakuRak Feb 07 '20 at 20:43
  • This doesn't work for me on tidyverse 1.3.0. but iris_grouped <- iris %>% nest(-Species) %>% slice_sample(n=2) %>% unnest does. – Adam Lee Perelman Sep 09 '21 at 11:30
  • Using `tidyr 1.2.0` and `dplyr 1.0.9` neither the answer or the suggestion by @AdamLeePerelman work. I found `iris %>% group_by(Species) %>% nest() %>% ungroup() %>% slice_sample(n=2)` worked. – Cole Robertson Aug 01 '22 at 11:57
3

Take note that using dplyr is considerably slower than regular data frame operations:

library(microbenchmark)
microbenchmark(dplyr= iris %>% filter(Species %in% sample(levels(Species),2)),
               base= iris[iris[["Species"]] %in% sample(levels(iris[["Species"]]), 2),])

Unit: microseconds
  expr     min      lq     mean  median       uq      max neval cld
 dplyr 660.287 710.655 753.6704 722.629 771.2860 1122.527   100   b
  base  83.629  95.032 110.0936 106.057 119.1715  199.949   100  a 

Note [[ is known to be faster than $, although both work

Christopher Oezbek
  • 23,994
  • 6
  • 61
  • 85
alexwhitworth
  • 4,839
  • 5
  • 32
  • 59
2

I really like the approach described by Tristan Mahr here. I've copied his function from the blog for the example below:

library(tidyverse)

sample_n_of <- function(data, size, ...) {
  dots <- quos(...)
  
  group_ids <- data %>% 
    group_by(!!! dots) %>% 
    group_indices()
  
  sampled_groups <- sample(unique(group_ids), size)
  
  data %>% 
    filter(group_ids %in% sampled_groups)
}

set.seed(1234)
mpg %>% 
  sample_n_of(size = 2, model)
#> # A tibble: 12 x 11
#>    manufacturer model   displ  year   cyl trans   drv     cty   hwy fl    class 
#>    <chr>        <chr>   <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr> 
#>  1 audi         a6 qua~   2.8  1999     6 auto(l~ 4        15    24 p     midsi~
#>  2 audi         a6 qua~   3.1  2008     6 auto(s~ 4        17    25 p     midsi~
#>  3 audi         a6 qua~   4.2  2008     8 auto(s~ 4        16    23 p     midsi~
#>  4 ford         mustang   3.8  1999     6 manual~ r        18    26 r     subco~
#>  5 ford         mustang   3.8  1999     6 auto(l~ r        18    25 r     subco~
#>  6 ford         mustang   4    2008     6 manual~ r        17    26 r     subco~
#>  7 ford         mustang   4    2008     6 auto(l~ r        16    24 r     subco~
#>  8 ford         mustang   4.6  1999     8 auto(l~ r        15    21 r     subco~
#>  9 ford         mustang   4.6  1999     8 manual~ r        15    22 r     subco~
#> 10 ford         mustang   4.6  2008     8 manual~ r        15    23 r     subco~
#> 11 ford         mustang   4.6  2008     8 auto(l~ r        15    22 r     subco~
#> 12 ford         mustang   5.4  2008     8 manual~ r        14    20 p     subco~

Created on 2021-03-24 by the reprex package (v0.3.0)

Bryan Shalloway
  • 748
  • 7
  • 15
1

I too had issues with Oscar's code using nest. But when I updated to the latest syntax of nest(), unnest(), and slice_sample() it worked.

Below is an alternate version that will produce the same answers, if the input frame is arranged by the group variable. Otherwise the answers will be just as good on the average. This version has a couple advantages over the nest version: 1. The final data frame has columns in the original order; in contrast the nest version puts the grouping variable first. 2: The intermediate results are a lot easier to read when you are debugging, since they are plain old lists.

I am interested in sampling the original number of groups with replacement, as in clustered bootstrapping. One could easily add more parameters to make the function more general.

# function to compute a clustered bootstrap sample
samplebygroups <- function(df, groupvar){
  datalist <- df %>%
    group_by({{ groupvar }}) %>%
    group_split
  n <- length(datalist)
  samplegroups <- sample(n, replace = TRUE)
  datalist[samplegroups] %>%
    bind_rows
}

Here is a sample run

smallcars <- mtcars %>%  
  rownames_to_column(var = "Model") %>% 
  tail(5) %>%
  arrange(cyl) %>%
  select(Model, cyl, mpg)

 set.seed(1000)
 samplebygroups(smallcars, cyl)

with output

# A tibble: 5 x 3
  Model            cyl   mpg
  <chr>          <dbl> <dbl>
1 Ford Pantera L     8  15.8
2 Maserati Bora      8  15  
3 Ferrari Dino       6  19.7
4 Ford Pantera L     8  15.8
5 Maserati Bora      8  15  

You would get exactly the same rows using Oscar's code, but cyl would be the first column.