sample rows of subgroups from dataframe with dplyr

Question

If I want to randomly select some samples from different groups I use the plyr package and the code below

require(plyr)
sampleGroup<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

iris.sample<-ddply(iris,.(Species),function(df) sampleGroup(df,10))

Here 10 samples are selected from each species.

Some of my dataframes are very big and my question is can I use the same sampleGroup function with the dplyr package? Or is there another way to do the same in dplyr?

EDIT

Version 0.2 of the dplyr package introduced two new functions to select random rows from a table sample_n and sample_frac

here is a link to a dplyr intro. http://rpubs.com/hadley/dplyr-intro — marbel, Jan 21 '14 at 12:26
Thanks, but I think the solution to this problem is not in the documentation yet. Nice solution with data.table though! — Robert, Jan 21 '14 at 12:38
Why not simply using `iris %.% group_by(Species) %.% sampleGroup(size = 10)` — dickoa, Jan 21 '14 at 16:16
I don't think there's a natural pure dplyr solution, but sampling seems sufficiently important that it should be a top-level function: https://github.com/hadley/dplyr/issues/202 — hadley, Jan 21 '14 at 16:22
@Robert I'm not sure how I missed that in your question; it is quite clearly stated. Deleting my comment. — Brian Diggs, Jan 21 '14 at 19:28
Great that @hadley wants to add a sample function to the dplyr package. I found a solution using only dplyr functions but it is very slow: `system.time(rbind_all(do(testdata %.% group_by(group),function(x) sampleGroup(x,10))))` @Troy's solution for dplyr is much faster. — Robert, Jan 24 '14 at 08:29

score 70 · Answer 1 · edited Aug 18 '22 at 14:18

70

Yes, you can use dplyr:

mtcars %>% 
    group_by(cyl) %>%
    slice_sample(n = 2))

and the results are like this

Source: local data frame [6 x 11]
Groups: cyl

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
3 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
4 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
5 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
6 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Historical note: slice_sample() replaces sample_n() in dplyr 1.0.0 (May 2020). Early versions of dplyr required do(sample_n(., 2)).

edited Aug 18 '22 at 14:18

Todd West

326
1
8

answered Apr 21 '14 at 07:48

PhilChang

2,591
1
16
18

@Arun , Yes, But you should update the dplyr to the newest version 0.1.3.0.99. – PhilChang Apr 23 '14 at 00:18
@Arun,sorry, you should use sample_n() – PhilChang Apr 25 '14 at 02:25
Is there a way to do this without using `do`? – Brani Jul 11 '14 at 06:31
3

Can you clock your stuff against the `data.table` solutions above? I stay in `dplyr` as much as I can because the grammar is easier (or at least I haven't learned `data.table` yet). It kind of drives me crazy that every `dplyr` question on SO gets a `data.table` answer, so I would like to see if this new code gets close. – gregmacfarlane Jul 31 '14 at 18:54
@gregmacfarlane Just read the comments above and it will make sense. There wasn't an acceptable way to do this with `dplyr` at the time. After reading the current docs at the time, the OP answered: " Thanks, but I think the solution to this problem is not in the documentation yet. Nice solution with data.table though! – Robert". Also read the other answers from the time the question was ask, they don't look like amazing solutions... – marbel Feb 17 '16 at 03:12
@PhilChang I get this error message when I run the following code: clickers %>% group_by(ListName)%>% sample_n(200) Error: `size` must be less or equal than 29 (size of data), set `replace` = TRUE to use sampling with replacement – user3614783 Apr 09 '18 at 17:08
@user3614783, use `sample_n(min(n(),200)`. The problem is that some of your groups are not 200 row longs. – Bastien Mar 18 '19 at 15:12

marbel · Answer 2 · 2014-01-27T23:09:43.273

This is easy to do with data.table, and useful for a big table.

NOTE: As mentioned in the coments by Troy, there is a more effiecient way of doing this using data.table, but i wanted to respect the OP sample function and format in the answer.

require(data.table)
DT <- data.table(x = rnorm(10e6, 100, 50), y = letters)

sampleGroup<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

result <- DT[, sampleGroup(.SD, 10), by=y]
print(result)

# y         x y
# 1: a  30.11659 m
# 2: a  57.99974 h
# 3: a  58.13634 o
# 4: a  87.28466 x
# 5: a  85.54986 j
# ---              
# 256: z 149.85817 d
# 257: z 160.24293 e
# 258: z  26.63071 j
# 259: z  17.00083 t
# 260: z 130.27796 f

system.time(DT[, sampleGroup(.SD, 10), by=y])
# user  system elapsed 
# 0.66    0.02    0.69 

Using the iris dataset:
iris <- data.table(iris)
iris[,sampleGroup(.SD, 10), by=Species]

+1 for data.table. Using `.I` doubles the performance speed: `iris[iris[,list(idx=sample(.I,10)),by="Species"]$idx]` — Troy, Jan 21 '14 at 13:08
I think you want `sampleGroup(.SD, 10)` (note `.SD` instead of `DT`) — eddi, Jan 22 '14 at 17:11

Troy · Answer 3 · 2014-01-21T13:05:31.847

That's a good question! Can't see any easy way to do it with the documented syntax for dplyr but how about this for a workaround?

sampleGroup<-function(df,x=1){

  df[
    unlist(lapply(attr((df),"indices"),function(r)sample(r,min(length(r),x))))
    ,]

}

sampleGroup(iris %.% group_by(Species),3)

#Source: local data frame [9 x 5]
#Groups: Species
#
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#39           4.4         3.0          1.3         0.2     setosa
#16           5.7         4.4          1.5         0.4     setosa
#25           4.8         3.4          1.9         0.2     setosa
#51           7.0         3.2          4.7         1.4 versicolor
#62           5.9         3.0          4.2         1.5 versicolor
#59           6.6         2.9          4.6         1.3 versicolor
#148          6.5         3.0          5.2         2.0  virginica
#103          7.1         3.0          5.9         2.1  virginica
#120          6.0         2.2          5.0         1.5  virginica

EDIT - PERFORMANCE COMPARISON

Here's a test against using data.table (both native and with a function call as per the example) for 1m rows, 26 groups.

Native data.table is about 2x as fast as the dplyr workaround and also than data.table call with callout. So probably dplyr / data.table are about the same performance.

Hopefully the dplyr guys will give us some native syntax for sampling soon! (or even better, maybe it's already there)

sampleGroup.dt<-function(df,size) {
  df[sample(nrow(df),size=size),]
}

testdata<-data.frame(group=sample(letters,10e5,T),runif(10e5))

dti<-data.table(testdata)

# using the dplyr workaround with external function call
system.time(sampleGroup(testdata %.% group_by(group),10))
#user  system elapsed 
#0.07    0.00    0.06 

#using native data.table
system.time(dti[dti[,list(val=sample(.I,10)),by="group"]$val])
#user  system elapsed 
#0.04    0.00    0.03 

#using data.table with external function call
system.time(dti[, sampleGroup.dt(dti, 10), by=group])
#user  system elapsed 
#0.06    0.02    0.08

+1 for Troy´s anser using data.table in the rigth way. My answer is probably slower because it copies two times the table. — marbel, Jan 21 '14 at 13:32
+1 very nice comparisons. But I don't understand your reason for the last benchmark? You're sample the whole data for 10 elements for every group. Whereas you're doing something with `attributes` for `dplyr` case.. Why not benchmark the same for `dplyr` with a function similar to the 3rd case for `DT` as well? — Arun, Jan 21 '14 at 14:58
Also, an important aspect of benchmarking is to see how well it **scales**. With just 26 groups to aggregate by, there'll be no real difference one can detect. Change your line to `testdata<-data.frame(group=sample(paste("id", 1:1e5, sep=""),10e5,T),runif(10e5))` and run your benchmarks again — Arun, Jan 21 '14 at 14:59
Please note that the internals of dplyr (e.g. the `indices` attributes) are likely to evolve. Don't rely on their structure. — Romain Francois, Jan 30 '14 at 11:07

score 3 · Answer 4 · answered Dec 04 '20 at 20:30

3

Dplyr 1.0.2 can subset with various verbs now: https://dplyr.tidyverse.org/reference/slice.html including random slice_sample:

mtcars %>% 
  slice_sample(n = 10)

and add a group by to sample by a category:

mtcars %>% 
  group_by(cyl) %>% 
  slice_sample(n = 2)

answered Dec 04 '20 at 20:30

Zoë Turner

459
5
8

Hi @zoë-turner, I wonder if you know how to set seed for the `slice_sample`? See my question: https://stackoverflow.com/questions/75751139/how-to-resample-the-rows-by-group-with-the-same-seed-in-r – WenliL Mar 16 '23 at 02:46

sample rows of subgroups from dataframe with dplyr

4 Answers4

Linked

Related