41

How would you use data.table to efficiently take a sample of rows within each group in a data frame?

DT = data.table(a = sample(1:2), b = sample(1:1000,20))
DT
    a   b
 1: 2 562
 2: 1 183
 3: 2 180
 4: 1 874
 5: 2 533
 6: 1  21
 7: 2  57
 8: 1  20
 9: 2  39
10: 1 948
11: 2 799
12: 1 893
13: 2 993
14: 1  69
15: 2 906
16: 1 347
17: 2 969
18: 1 130
19: 2 118
20: 1 732

I was thinking of something like: DT[ , sample(??, 3), by = a] that would return a sample of three rows for each "a" (the order of the returned rows isn't significant):

    a   b
 1: 2 180
 2: 2  57
 3: 2 799
 4: 1  69
 5: 1 347
 6: 1 732
pogibas
  • 27,303
  • 19
  • 84
  • 117
Christopher Manning
  • 4,527
  • 2
  • 27
  • 36

5 Answers5

62

Maybe something like this?

> DT[,.SD[sample(.N, min(3,.N))],by = a]
   a   b
1: 1 744
2: 1 497
3: 1 167
4: 2 888
5: 2 950
6: 2 343

(Thanks to Josh for the correction, below.)

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
joran
  • 169,992
  • 32
  • 429
  • 468
  • 1
    This is identical in this case: `DT[, sample(b, 3), by=a]` (except for names) – GSee Apr 29 '13 at 22:30
  • @GSee good eye. That would work in my apparently oversimplified example, but in practice, I want to return the whole row like joran's answer does. – Christopher Manning Apr 29 '13 at 22:35
  • 16
    @ChristopherManning -- I'm just guessing, but (if it really matters), in some cases this might be faster: `DT[DT[, sample(.I, 3), by=a][[2]],]`. (`sample(.I,3)` samples row numbers relative to `DT`). This call's advantage is that it doesn't require that each subset `.SD` be entirely populated as the call is processed. – Josh O'Brien Apr 29 '13 at 22:41
  • 2
    Thanks for the help! I also found out that I can use an `if` statement to conditionally change the number of samples returned for each group based on the value of `a`: `DT[,.SD[sample(.N, if(a == 1) 2 else 3)],by = a]` – Christopher Manning Apr 29 '13 at 22:52
  • 1
    @JoshO'Brien , I’m trying to understand some behavior of your solution. I think applying `sample` on `.I` when it happens that the grouping yields only one row might give unexpected results. In such cases `.I` contains only an integer (the row position in DT) and `sample` behaves differently than when fed with a vector. The soltunion `DT[DT[ , .I[sample(.N,3)] , by = a]$V1]` as proposed [here](https://stackoverflow.com/questions/33887083/from-data-table-randomly-select-one-row-per-group) by @akrun might fix it. – Valentin_Ștefan Sep 10 '17 at 11:01
  • 1
    @Valentine Good point. Your comment also made me notice that there was a flaw in the original answer, which would fail for any groups with less than three rows. (Try `sample(2,3)` to see what I mean.) I've fixed it now, using a fix that should also presumably be applied to your proposed solution. Thanks for your comment! – Josh O'Brien Sep 10 '17 at 18:41
  • Hi Joran. Sorry -- somehow I went ahead and edited your answer, thinking it had been my own, and only just now noticed that it was not! Am leaving the edit in place for now, though, so that you can keep or roll back as you see fit... – Josh O'Brien Sep 10 '17 at 18:46
  • @JoshO'Brien No problem! – joran Sep 10 '17 at 19:37
8

I believe joran's answer can be further generalized. The details are here (How do you sample groups in a data.table with a caveat) but I believe this solution accounts for cases where there aren't "3" rows to sample from.

The current solution will error out when it tries to sample "x" times from rows that have less than "x" common values. In the below case, x=3. And it takes into consideration this caveat. (Solution done by nrussell)

set.seed(123)
##
DT <- data.table(
  a=c(1,1,1,1:15,1,1), 
  b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
     a   b
 1:  1 288
 2:  1 881
 3:  1 409
 4:  2 937
 5:  3  46
 6:  4 525
 7:  5 887
 8:  6 548
 9:  7 453
10:  8 948
11:  9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15  42
Community
  • 1
  • 1
road_to_quantdom
  • 1,341
  • 1
  • 13
  • 20
4

There are two subtle considerations that impact the answer to this question, and these are mentioned by Josh O'Brien and Valentin in comments. The first is that subsetting via .SD is very inefficient, and it is better to sample .I directly (see the benchmark below).

The second consideration, if we do sample from .I, is that calling sample(.I, size = 1) leads to unexpected behavior when .I > 1 and length(.I) = 1. In this case, sample() behaves as if we called sample(1:.I, size = 1), which is surely not what we want. As Valentin notes, it's better to use the construct .I[sample(.N, size = 1)] in this case.

As a benchmark, we build a simple 1,000 x 1 data.table and sample randomly per group. Even with such a small data.table the .I method is roughly 20x faster.

library(microbenchmark)
library(data.table)

set.seed(1L)
DT <- data.table(id = sample(1e3, 1e3, replace = TRUE))

microbenchmark(
  `.I` = DT[DT[, .I[sample(.N, 1)], by = id][[2]]],
  `.SD` = DT[, .SD[sample(.N, 1)], by = id]
)
#> Unit: milliseconds
#>  expr       min        lq     mean    median        uq       max neval
#>    .I  2.396166  2.588275  3.22504  2.794152  3.118135  19.73236   100
#>   .SD 55.798177 59.152000 63.72131 61.213650 64.205399 102.26781   100

Created on 2020-12-02 by the reprex package (v0.3.0)

tomshafer
  • 795
  • 3
  • 8
  • 12
2

Inspired by this answer by David Arenburg, another method to avoid the .SD allocation would be to sample the groups, then join back onto the original data using .EACHI

DT[ DT[, sample(.N, 3), by=a], b[i.V1], on="a", by=.EACHI]

#    a  V1
# 1: 2  42
# 2: 2 498
# 3: 2 179
# 4: 1 469
# 5: 1  93
# 6: 1 898

where the DT[, sample(.N, 3), by=a] line gives us a sample for each group

#         a V1
# 1:      1  9
# 2:      1  3
# 3:      1  2
# 4:      2  4
# 5:      2  9
# ---          

so we can then use V1 to give us the b it corresponds to.

Community
  • 1
  • 1
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
0

Stratified sampling > oversampling

size=don[y==1,.(strata=length(iden)),by=.(y,x)] # count of iden by strata   
table(don$x,don$y) 

don<-merge(don,size[,.(y,strata)],by="x") #merge strata values  
don_strata=don[,.SD[sample(.N,strata)],by=.(y,x)]
J. Chomel
  • 8,193
  • 15
  • 41
  • 69