36

If I want to randomly select some samples from different groups I use the plyr package and the code below

require(plyr)
sampleGroup<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

iris.sample<-ddply(iris,.(Species),function(df) sampleGroup(df,10))

Here 10 samples are selected from each species.

Some of my dataframes are very big and my question is can I use the same sampleGroup function with the dplyr package? Or is there another way to do the same in dplyr?

EDIT

Version 0.2 of the dplyr package introduced two new functions to select random rows from a table sample_n and sample_frac

smci
  • 32,567
  • 20
  • 113
  • 146
Robert
  • 924
  • 1
  • 9
  • 20
  • here is a link to a dplyr intro. http://rpubs.com/hadley/dplyr-intro – marbel Jan 21 '14 at 12:26
  • Thanks, but I think the solution to this problem is not in the documentation yet. Nice solution with data.table though! – Robert Jan 21 '14 at 12:38
  • 1
    Why not simply using `iris %.% group_by(Species) %.% sampleGroup(size = 10)` – dickoa Jan 21 '14 at 16:16
  • 2
    I don't think there's a natural pure dplyr solution, but sampling seems sufficiently important that it should be a top-level function: https://github.com/hadley/dplyr/issues/202 – hadley Jan 21 '14 at 16:22
  • @Robert I'm not sure how I missed that in your question; it is quite clearly stated. Deleting my comment. – Brian Diggs Jan 21 '14 at 19:28
  • Great that @hadley wants to add a sample function to the dplyr package. I found a solution using only dplyr functions but it is very slow: `system.time(rbind_all(do(testdata %.% group_by(group),function(x) sampleGroup(x,10))))` @Troy's solution for dplyr is much faster. – Robert Jan 24 '14 at 08:29

4 Answers4

70

Yes, you can use dplyr:

mtcars %>% 
    group_by(cyl) %>%
    slice_sample(n = 2))

and the results are like this

Source: local data frame [6 x 11]
Groups: cyl

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
3 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
4 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
5 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
6 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Historical note: slice_sample() replaces sample_n() in dplyr 1.0.0 (May 2020). Early versions of dplyr required do(sample_n(., 2)).

Todd West
  • 326
  • 1
  • 8
PhilChang
  • 2,591
  • 1
  • 16
  • 18
  • @Arun , Yes, But you should update the dplyr to the newest version 0.1.3.0.99. – PhilChang Apr 23 '14 at 00:18
  • @Arun,sorry, you should use sample_n() – PhilChang Apr 25 '14 at 02:25
  • Is there a way to do this without using `do`? – Brani Jul 11 '14 at 06:31
  • 3
    Can you clock your stuff against the `data.table` solutions above? I stay in `dplyr` as much as I can because the grammar is easier (or at least I haven't learned `data.table` yet). It kind of drives me crazy that every `dplyr` question on SO gets a `data.table` answer, so I would like to see if this new code gets close. – gregmacfarlane Jul 31 '14 at 18:54
  • @gregmacfarlane Just read the comments above and it will make sense. There wasn't an acceptable way to do this with `dplyr` at the time. After reading the current docs at the time, the OP answered: " Thanks, but I think the solution to this problem is not in the documentation yet. Nice solution with data.table though! – Robert". Also read the other answers from the time the question was ask, they don't look like amazing solutions... – marbel Feb 17 '16 at 03:12
  • @PhilChang I get this error message when I run the following code: clickers %>% group_by(ListName)%>% sample_n(200) Error: `size` must be less or equal than 29 (size of data), set `replace` = TRUE to use sampling with replacement – user3614783 Apr 09 '18 at 17:08
  • @user3614783, use `sample_n(min(n(),200)`. The problem is that some of your groups are not 200 row longs. – Bastien Mar 18 '19 at 15:12
10

This is easy to do with data.table, and useful for a big table.

NOTE: As mentioned in the coments by Troy, there is a more effiecient way of doing this using data.table, but i wanted to respect the OP sample function and format in the answer.

require(data.table)
DT <- data.table(x = rnorm(10e6, 100, 50), y = letters)

sampleGroup<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

result <- DT[, sampleGroup(.SD, 10), by=y]
print(result)

# y         x y
# 1: a  30.11659 m
# 2: a  57.99974 h
# 3: a  58.13634 o
# 4: a  87.28466 x
# 5: a  85.54986 j
# ---              
# 256: z 149.85817 d
# 257: z 160.24293 e
# 258: z  26.63071 j
# 259: z  17.00083 t
# 260: z 130.27796 f

system.time(DT[, sampleGroup(.SD, 10), by=y])
# user  system elapsed 
# 0.66    0.02    0.69 

Using the iris dataset:
iris <- data.table(iris)
iris[,sampleGroup(.SD, 10), by=Species]
marbel
  • 7,560
  • 6
  • 49
  • 68
  • 2
    +1 for data.table. Using `.I` doubles the performance speed: `iris[iris[,list(idx=sample(.I,10)),by="Species"]$idx]` – Troy Jan 21 '14 at 13:08
  • 1
    I think you want `sampleGroup(.SD, 10)` (note `.SD` instead of `DT`) – eddi Jan 22 '14 at 17:11
7

That's a good question! Can't see any easy way to do it with the documented syntax for dplyr but how about this for a workaround?

sampleGroup<-function(df,x=1){

  df[
    unlist(lapply(attr((df),"indices"),function(r)sample(r,min(length(r),x))))
    ,]

}

sampleGroup(iris %.% group_by(Species),3)

#Source: local data frame [9 x 5]
#Groups: Species
#
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#39           4.4         3.0          1.3         0.2     setosa
#16           5.7         4.4          1.5         0.4     setosa
#25           4.8         3.4          1.9         0.2     setosa
#51           7.0         3.2          4.7         1.4 versicolor
#62           5.9         3.0          4.2         1.5 versicolor
#59           6.6         2.9          4.6         1.3 versicolor
#148          6.5         3.0          5.2         2.0  virginica
#103          7.1         3.0          5.9         2.1  virginica
#120          6.0         2.2          5.0         1.5  virginica

EDIT - PERFORMANCE COMPARISON

Here's a test against using data.table (both native and with a function call as per the example) for 1m rows, 26 groups.

Native data.table is about 2x as fast as the dplyr workaround and also than data.table call with callout. So probably dplyr / data.table are about the same performance.

Hopefully the dplyr guys will give us some native syntax for sampling soon! (or even better, maybe it's already there)

sampleGroup.dt<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

testdata<-data.frame(group=sample(letters,10e5,T),runif(10e5))

dti<-data.table(testdata)

# using the dplyr workaround with external function call
system.time(sampleGroup(testdata %.% group_by(group),10))
#user  system elapsed 
#0.07    0.00    0.06 

#using native data.table
system.time(dti[dti[,list(val=sample(.I,10)),by="group"]$val])
#user  system elapsed 
#0.04    0.00    0.03 

#using data.table with external function call
system.time(dti[, sampleGroup.dt(dti, 10), by=group])
#user  system elapsed 
#0.06    0.02    0.08 
Troy
  • 8,581
  • 29
  • 32
  • +1 for Troy´s anser using data.table in the rigth way. My answer is probably slower because it copies two times the table. – marbel Jan 21 '14 at 13:32
  • 1
    +1 very nice comparisons. But I don't understand your reason for the last benchmark? You're sample the whole data for 10 elements for every group. Whereas you're doing something with `attributes` for `dplyr` case.. Why not benchmark the same for `dplyr` with a function similar to the 3rd case for `DT` as well? – Arun Jan 21 '14 at 14:58
  • 3
    Also, an important aspect of benchmarking is to see how well it **scales**. With just 26 groups to aggregate by, there'll be no real difference one can detect. Change your line to `testdata<-data.frame(group=sample(paste("id", 1:1e5, sep=""),10e5,T),runif(10e5))` and run your benchmarks again – Arun Jan 21 '14 at 14:59
  • Please note that the internals of dplyr (e.g. the `indices` attributes) are likely to evolve. Don't rely on their structure. – Romain Francois Jan 30 '14 at 11:07
3

Dplyr 1.0.2 can subset with various verbs now: https://dplyr.tidyverse.org/reference/slice.html including random slice_sample:

mtcars %>% 
  slice_sample(n = 10)

and add a group by to sample by a category:

mtcars %>% 
  group_by(cyl) %>% 
  slice_sample(n = 2)
Zoë Turner
  • 459
  • 5
  • 8
  • Hi @zoë-turner, I wonder if you know how to set seed for the `slice_sample`? See my question: https://stackoverflow.com/questions/75751139/how-to-resample-the-rows-by-group-with-the-same-seed-in-r – WenliL Mar 16 '23 at 02:46