Given a dataframe df
with a column called group
, how do you randomly sample k
groups from it in dplyr? It should return all rows from k
groups (given there are at least k
unique values in df$group
), and every group in df
should be equally likely to be returned.

- 2,564
- 5
- 21
- 22
-
2perhaps you could provide some example data? Also, see `?sample_n` – SymbolixAU May 10 '16 at 21:57
-
Iris is sufficient. The grouping variable there is species. – Big Dogg May 10 '16 at 21:59
-
1Using ``sample_n`` gives ``n`` randomly sampled rows per group. I'm asking for all rows from ``n`` randomly sampled groups. – Big Dogg May 10 '16 at 22:00
-
So you want to randomly select, say, 2 of the 3 levels of Species and then return all rows for those two selected levels? – eipi10 May 10 '16 at 22:03
5 Answers
Just use sample()
to choose some number of groups
iris %>% filter(Species %in% sample(levels(Species),2))

- 195,160
- 17
- 277
- 295
-
6you can use 'unique(Species)' instead of 'levels(Species)' if you have a character or numeric identifier column. – Dan Slone Apr 23 '21 at 04:52
-
1That is a nice solution but I just ran into a problem here as I wish to sample with replacements which is not compatible with the %in% statement. – Daniel Münch May 19 '21 at 12:57
-
very elegant answer. If you are looking for the reverse (n entries per id), have a look here: https://stackoverflow.com/questions/18258690/take-random-sample-by-group – Samuel Saari Aug 19 '22 at 05:23
I think this approach makes the most sense if you are using dplyr:
iris_grouped <- iris %>%
group_by(Species) %>%
nest()
Which produces:
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica <tibble [50 × 4]>
with which you can then use sample_n
:
iris_grouped %>%
sample_n(2)
# A tibble: 2 x 2
Species data
<fct> <list>
1 virginica <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>

- 349
- 3
- 5
-
2That's great. Don't forget to `unnest()` at the end for further calculations. – Marco Dec 17 '19 at 13:28
-
9I really prefer this syntax but on my large dataset this method took hours to run while @MrFlick 's answer only took a second. – chakuRak Feb 07 '20 at 20:43
-
This doesn't work for me on tidyverse 1.3.0. but iris_grouped <- iris %>% nest(-Species) %>% slice_sample(n=2) %>% unnest does. – Adam Lee Perelman Sep 09 '21 at 11:30
-
Using `tidyr 1.2.0` and `dplyr 1.0.9` neither the answer or the suggestion by @AdamLeePerelman work. I found `iris %>% group_by(Species) %>% nest() %>% ungroup() %>% slice_sample(n=2)` worked. – Cole Robertson Aug 01 '22 at 11:57
Take note that using dplyr
is considerably slower than regular data frame operations:
library(microbenchmark)
microbenchmark(dplyr= iris %>% filter(Species %in% sample(levels(Species),2)),
base= iris[iris[["Species"]] %in% sample(levels(iris[["Species"]]), 2),])
Unit: microseconds
expr min lq mean median uq max neval cld
dplyr 660.287 710.655 753.6704 722.629 771.2860 1122.527 100 b
base 83.629 95.032 110.0936 106.057 119.1715 199.949 100 a
Note [[
is known to be faster than $
, although both work

- 23,994
- 6
- 61
- 85

- 4,839
- 5
- 32
- 59
I really like the approach described by Tristan Mahr here. I've copied his function from the blog for the example below:
library(tidyverse)
sample_n_of <- function(data, size, ...) {
dots <- quos(...)
group_ids <- data %>%
group_by(!!! dots) %>%
group_indices()
sampled_groups <- sample(unique(group_ids), size)
data %>%
filter(group_ids %in% sampled_groups)
}
set.seed(1234)
mpg %>%
sample_n_of(size = 2, model)
#> # A tibble: 12 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a6 qua~ 2.8 1999 6 auto(l~ 4 15 24 p midsi~
#> 2 audi a6 qua~ 3.1 2008 6 auto(s~ 4 17 25 p midsi~
#> 3 audi a6 qua~ 4.2 2008 8 auto(s~ 4 16 23 p midsi~
#> 4 ford mustang 3.8 1999 6 manual~ r 18 26 r subco~
#> 5 ford mustang 3.8 1999 6 auto(l~ r 18 25 r subco~
#> 6 ford mustang 4 2008 6 manual~ r 17 26 r subco~
#> 7 ford mustang 4 2008 6 auto(l~ r 16 24 r subco~
#> 8 ford mustang 4.6 1999 8 auto(l~ r 15 21 r subco~
#> 9 ford mustang 4.6 1999 8 manual~ r 15 22 r subco~
#> 10 ford mustang 4.6 2008 8 manual~ r 15 23 r subco~
#> 11 ford mustang 4.6 2008 8 auto(l~ r 15 22 r subco~
#> 12 ford mustang 5.4 2008 8 manual~ r 14 20 p subco~
Created on 2021-03-24 by the reprex package (v0.3.0)

- 748
- 7
- 15
I too had issues with Oscar's code using nest. But when I updated to the latest syntax of nest(), unnest(), and slice_sample() it worked.
Below is an alternate version that will produce the same answers, if the input frame is arranged by the group variable. Otherwise the answers will be just as good on the average. This version has a couple advantages over the nest version: 1. The final data frame has columns in the original order; in contrast the nest version puts the grouping variable first. 2: The intermediate results are a lot easier to read when you are debugging, since they are plain old lists.
I am interested in sampling the original number of groups with replacement, as in clustered bootstrapping. One could easily add more parameters to make the function more general.
# function to compute a clustered bootstrap sample
samplebygroups <- function(df, groupvar){
datalist <- df %>%
group_by({{ groupvar }}) %>%
group_split
n <- length(datalist)
samplegroups <- sample(n, replace = TRUE)
datalist[samplegroups] %>%
bind_rows
}
Here is a sample run
smallcars <- mtcars %>%
rownames_to_column(var = "Model") %>%
tail(5) %>%
arrange(cyl) %>%
select(Model, cyl, mpg)
set.seed(1000)
samplebygroups(smallcars, cyl)
with output
# A tibble: 5 x 3
Model cyl mpg
<chr> <dbl> <dbl>
1 Ford Pantera L 8 15.8
2 Maserati Bora 8 15
3 Ferrari Dino 6 19.7
4 Ford Pantera L 8 15.8
5 Maserati Bora 8 15
You would get exactly the same rows using Oscar's code, but cyl would be the first column.

- 11
- 2