Creating a random sample from a dataframe with a nested structure

Question

This question builds from the SO post found here

I am trying to extract a random sample of rows in a data frame using a nesting condition.

Using the following dummy dataset (modified from iris):

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          5.3         2.9          1.5         0.2  setosa
5          5.2         3.7          1.3         0.2  virginica
6          4.7         3.2          1.5         0.2  virginica
7          3.9         3.1          1.4         0.2  virginica
8          4.7         3.2          1.3         0.2  virginica
9          4.0         3.1          1.5         0.2  versicolor
10         5.0         3.6          1.4         0.2  versicolor
11         4.6         3.1          1.5         0.2  versicolor
12         5.0         3.6          1.5         0.2  versicolor

The code below works fine to take a simple sample of 2 rows:

iris[sample(nrow(iris), 2), ]

However, what I would like to do is to take a sample of 2 rows for each level of a specific variable. For example create a random sample of 2 rows for each level of the variable 'Species', like that:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
4          5.3         2.9          1.5         0.2  setosa
6          4.7         3.2          1.5         0.2  virginica
7          3.9         3.1          1.4         0.2  virginica
11         4.6         3.1          1.5         0.2  versicolor
12         5.0         3.6          1.5         0.2  versicolor

Thanks for your help!

Nevermind.. I just found the answer to my question here: http://stackoverflow.com/questions/23831711/selecting-n-random-rows-across-all-levels-of-a-factor-within-a-dataframe?rq=1 — Aurelie Calabrese, Mar 17 '15 at 21:47

Gregor Thomas · Accepted Answer · 2015-03-17T22:01:27.187

6

Very easy with dplyr:

library(dplyr)
iris %>%
    group_by(Species) %>%
    sample_n(size = 2)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 1          4.6         3.4          1.4         0.3     setosa
# 2          5.2         3.5          1.5         0.2     setosa
# 3          6.5         2.8          4.6         1.5 versicolor
# 4          5.7         2.8          4.5         1.3 versicolor
# 5          5.8         2.8          5.1         2.4  virginica
# 6          7.7         2.6          6.9         2.3  virginica

You can group by as many columns as you'd like

CO2 %>% group_by(Type, Treatment) %>% sample_n(size = 2)

edited Mar 17 '15 at 22:01

answered Mar 17 '15 at 21:44

Gregor Thomas

136,190
20
167
294

Thanks Gregor! Very elegant ;-) What if I have another variable nested within Species? Is there a way to specify in in the group_by argument? – Aurelie Calabrese Mar 17 '15 at 21:55
Yes, see edits. You can group by as many variables as you would like. – Gregor Thomas Mar 16 '16 at 16:03

Creating a random sample from a dataframe with a nested structure

1 Answers1