30

From these questions - Random sample of rows from subset of an R dataframe & Sample random rows in dataframe I can easily see how to randomly sample (select) 'n' rows from a df, or 'n' rows that originate from a specific level of a factor within a df.

Here are some sample data:

df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)

df[sample(nrow(df), 3), ] #samples 3 random rows from df, without replacement.

To e.g. just sample 3 random rows from 'pink' color - using library(kimisc):

library(kimisc)
sample.rows(subset(df, color == "pink"), 3)

or writing custom function:

sample.df <- function(df, n) df[sample(nrow(df), n), , drop = FALSE]
sample.df(subset(df, color == "pink"), 3)

However, I want to sample 3 (or n) random rows from each level of the factor. I.e. the new df would have 12 rows (3 from blue, 3 from red, 3 from yellow, 3 from pink). It's obviously possible to run this several times, create newdfs for each color, and then bind them together, but I am looking for a simpler solution.

Henrik
  • 65,555
  • 14
  • 143
  • 159
jalapic
  • 13,792
  • 8
  • 57
  • 87
  • See also [How do you sample random rows within each group in a `data.table`?](https://stackoverflow.com/questions/16289182/how-do-you-sample-random-rows-within-each-group-in-a-data-table) – Henrik Aug 29 '17 at 10:45
  • 1
    Does this answer your question? [Take random sample by group](https://stackoverflow.com/questions/18258690/take-random-sample-by-group) – camille Dec 24 '21 at 15:49

5 Answers5

36

In versions of dplyr 0.3 and later, this works just fine:

df %>% group_by(color) %>% sample_n(size = 3)

Old versions of dplyr (version <= 0.2)

I set out to answer this using dplyr, assuming that this would work:

df %.% group_by(color) %.% sample_n(size = 3)

But it turns out that in 0.2 the sample_n.grouped_df S3 method exists but isn't registered in the NAMESPACE file, so it's never dispatched. Instead, I had to do this:

df %.% group_by(color) %.% dplyr:::sample_n.grouped_df(size = 3)
Source: local data frame [12 x 3]
Groups: color

            X1         X2  color
8   0.66152710 -0.7767473   blue
1  -0.70293752 -0.2372700   blue
2  -0.46691793 -0.4382669   blue
32 -0.47547565 -1.0179842   pink
31 -0.15254540 -0.6149726   pink
39  0.08135292 -0.2141423   pink
15  0.47721644 -1.5033192    red
16  1.26160230  1.1202527    red
12 -2.18431919  0.2370912    red
24  0.10493757  1.4065835 yellow
21 -0.03950873 -1.1582658 yellow
28 -2.15872261 -1.5499822 yellow

Presumably this will be fixed in a future update.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
joran
  • 169,992
  • 32
  • 429
  • 468
  • What version of `dplyr` are you using? Is it trunk? – momeara May 24 '14 at 00:45
  • I tried both 0.2 on cran and then installed from github; same thing. – joran May 24 '14 at 00:59
  • 1
    @joran in `dplyr 0.3` this works like a charm. It's my favorite way of doing the above problem now. – jalapic Nov 17 '14 at 04:19
  • Can anyone explain how this works conceptually? Does sample_n() look back to see if a group_by() has been applied. – axme100 Nov 06 '18 at 02:59
  • @axme100 The pipe `%>%` passes the results of each step forward to the next function, so there's no need to "look backward". Run `x <- mtcars %>% group_by(cyl)` and then start looking at `x`. You'll see that it has a new class attributes, along with many others (`attributes(x)`), so any subsequent function "knows" that it's dealing with a grouped data frame. – joran Nov 06 '18 at 03:35
  • @axme100 Then many of the other `dplyr` functions will have S3 methods specifically for `grouped_df` objects. See `methods(sample_n)`. – joran Nov 06 '18 at 03:36
  • This works well with `sample_frac` to keep the relative proportions of the classes. – Brian Stamper Apr 19 '20 at 17:34
  • This is now available as the [`slice_sample`](https://dplyr.tidyverse.org/reference/slice.html) function in `dplyr` – Maël Feb 15 '22 at 09:12
7

You can assign a random ID to each element that has a particular factor level using ave. Then you can select all random IDs in a certain range.

rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]

This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid vector to create subset of different lengths fairly easily.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Both this suggestion and the other answer both work very well. May I just check two things about the above code. 1) the variable X1. Does it matter which variable from the df is chosen here? (it doesn't seem to). 2) In the situation where the number of observations in different factor levels vary - and I want to return a subset of rows per factor level that exceeds the total amount present in some factor levels, that this solution will still work. i.e. if I ask for 11 rows per color, it will return 10. This may be useful in my real data where obs/rows per factor level do vary. – jalapic May 23 '14 at 15:22
  • @jalapic 1) You are correct in that it doesn't really matter which variable you pass as the first parameter. Passing a numeric vector helped to keep the result numeric. 2) If you ask for 10 rows (`rndid<=10`) and a group only has 3, all three rows for that group will be returned and no missing values will be introduced nor will sampling be done with replacement. So you may wind up with unbalanced groups. – MrFlick May 23 '14 at 15:27
  • thank you. I don't mind about the unbalanced groups in this context, so that works perfectly. – jalapic May 23 '14 at 15:34
  • @MrFlick , I want to satisfy the sample size condition in chi square test, so I need to sample `at least` 5 case in each group, how can I do this using your solution? – Saeed Zhiany Jun 05 '18 at 06:53
7

I would consider my stratified function, which is presently hosted as a GitHub Gist.

Get it with:

library(devtools)  ## To download "stratified"
source_gist("https://gist.github.com/mrdwab/6424112")

And use it with:

stratified(df, "color", 3)

There are several different features that are convenient for stratified sampling. For instance, you can also take a sample sort of "on the fly".

stratified(df, "color", 3, select = list(color = c("blue", "red")))

To give you a sense of what the function does, here are the arguments to stratified:

  • df: The input data.frame
  • group: A character vector of the column or columns that make up the "strata".
  • size: The desired sample size.
    • If size is a value less than 1, a proportionate sample is taken from each stratum.
    • If size is a single integer of 1 or more, that number of samples is taken from each stratum.
    • If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
  • select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
  • replace: For sampling with replacement.
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • this is a really neat function - very useful – jalapic Sep 06 '14 at 17:40
  • Nice and helpful. It appears that in some versions there is a bug in the source_gist function, which raises an error. I used a workaround like this: `source_gist("https://gist.github.com/mrdwab/6424112", filename = "stratified.R")` – soungalo Feb 19 '18 at 14:09
6

Here's a solution. We split a data.frame into color groups. Then we sample 3 rows from each group. This yields a list of data.frames.

df2 <- lapply(split(df, df$color),
   function(subdf) subdf[sample(1:nrow(subdf), 3),]
)

To obtain the desired result, we merge the list of data.frames into 1 data.frame:

do.call('rbind', df2)
##                    X1          X2  color
## blue.3    -1.22677188  1.25648082   blue
## blue.4    -0.54516686 -1.94342967   blue
## blue.1     0.44647071  0.16283326   blue
## pink.40    0.23520296 -0.40411906   pink
## pink.34    0.02033939 -0.32321309   pink
## pink.33   -1.01790533 -1.22618575   pink
## red.16     1.86545895  1.11691250    red
## red.11     1.35748078 -0.36044728    red
## red.13    -0.02425645  0.85335279    red
## yellow.21  1.96728782 -1.81388110 yellow
## yellow.25 -0.48084967  0.07865186 yellow
## yellow.24 -0.07056236 -0.28514125 yellow
gagolews
  • 12,836
  • 2
  • 50
  • 75
0

Here is a way, in base, that allows for multiple groups and sampling with replacement:

n <- 3
resample <- TRUE
index <- 1:nrow(df)
fun <- function(x) sample(x, n, replace = resample)
a <- aggregate(index, by = list(group = df$color), FUN = fun )

df[c(a$x),]

To add another group, include it in the 'by' argument to aggregate.

user3357177
  • 355
  • 2
  • 9