R bootstrap statistics by group for big data

Question

I want to bootstrap a data set that has groups in it. A simple scenario would be bootstrapping simple means:

data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2)), by = "group"]}
boot(data, stat, R = 10)

This gives me the error incorrect number of subscripts on matrix, because of by = "group" part. I managed to solve it using subsetting, but don't like this solution. Is there simpler way to make this kind of task work?

In particular, I'd like to introduce an additional argument in the statistics function like stat(x, i, groupvar) and pass it to the boot function like boot(data, stat(groupvar = group), R = 100)?

Do you want to do stratified resampling? There is a `strata` argument in `boot`. — Roland, Sep 20 '13 at 09:39
no, I just want to have separate statistics' values for each group. If I understand it correctly, strata argument ensures that your statistics would NOT be calculated using data from one group, doesn't it? Result value with strata would be one-dimensional, instead, I want n result statistics where n is the number of groups — RInatM, Sep 20 '13 at 10:44

score 3 · Answer 1 · answered Apr 20 '16 at 00:38

Using

 boot       * 1.3-18  2016-02-23 CRAN (R 3.2.3)                        
 data.table * 1.9.7   2015-10-05 Github (Rdatatable/data.table@d607425)

I received an error using the OP's code with the answer supplied by @eddi:

data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2)), by = "group"]}
data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

Produces the error message:

Error in eval(expr, envir, enclos) : object 'group' not found

The error is fixed by removing by=group from the function stat:

set.seed(1000)
data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2))]}
data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

Which produces the following Bootstrap Statistics results:

[[1]]

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = .SD, statistic = stat, R = 10)


Bootstrap Statistics :
     original       bias    std. error
t1* 0.5158232  0.004930451  0.01576641
t2* 0.5240713 -0.001851889  0.02851483

[[2]]

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = .SD, statistic = stat, R = 10)


Bootstrap Statistics :
     original        bias    std. error
t1* 0.5142383 -0.0072475030  0.02568692
t2* 0.5291694 -0.0001509404  0.02378447

Below, I modify the sample dataset to highlight which Bootstrap Statistic goes with which group-column combination:

Consider group 1 which has a mean value of 10 for x1 and a mean value of 10000 for x2 and group 2 which has a mean value of 2000 for x1 and a mean value of 8000 for x2:

data2 <- as.data.table(list(x1 = c(runif(100, 9,11),runif(100, 1999,2001)), x2 = c(runif(100, 9999,10001),runif(100, 7999,8001)), group = rep(c(1,2), each=100)))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2))]}
data2[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

Which gives:

[[1]]

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = .SD, statistic = stat, R = 10)


Bootstrap Statistics :
      original       bias    std. error
t1*   10.00907  0.007115938  0.04349184
t2* 9999.90176 -0.019569568  0.06160653

[[2]]

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = .SD, statistic = stat, R = 10)


Bootstrap Statistics :
    original       bias    std. error
t1* 1999.965  0.031694179  0.06561209
t2* 8000.110 -0.006569872  0.03992401

score 2 · Accepted Answer · answered Sep 20 '13 at 14:03

2

This should do it:

data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

answered Sep 20 '13 at 14:03

eddi

49,088
6
104
155

Nicely done. I was trying to do it with `data.table`'s `by` argument, but couldn't figure it out. Care to explain why the `list(list(` is necessary? – Ari B. Friedman Sep 20 '13 at 16:28
1

@AriB.Friedman you need the inner `list` to tell `data.table` that you're storing a `list` in the column (the outer `list`); maybe `data[, list(newcol = list(boot(...` would make that more clear – eddi Sep 20 '13 at 16:40
1

@eddi I found your answer required removing `by = group` from the OP's supplied stat function... please see my answer for more details. – swihart Apr 20 '16 at 00:44

score 1 · Answer 3 · edited May 23 '17 at 12:16

1

Lots of problems in your code before you even get to the by group part.

Did you mean something like this?

data <- as.data.frame(list(x1 = runif(200), x2 = runif(200), group = factor(sample(letters[1:2]))))
stat <- function(x, i)  c(m1 = mean(x$x1[i]), m2 = mean(x$x2[i]))

> stat(x,1:10)
       m1        m2 
0.4465738 0.5522221

Then from there you can worry about doing it by group however you choose to.

For instance:

library(plyr)
dlply( data, .(group), function( dat ) boot(dat, stat, R=10) )

For bigger datasets, try data.table:

by( seq(nrow(data)), data$group, function(idx) myboot(data[idx,]))

I went with by() rather than the data.table's ,by= argument because you want the output to be a list. There may be some functionality I don't know about for doing that, but I couldn't find it (see the edit history for the problem it was causing).

The subsetting is still done via the data.table's [] method, so it should be plenty fast.

edited May 23 '17 at 12:16

Community

1
1

answered Sep 20 '13 at 11:00

Ari B. Friedman

71,271
35
175
235

thanks, this does what I want. But the link you posted recommends data.table package for subsetting and grouping large datasets (in my case - 800 thousands rows with 220 groups). I still struggle to apply data.table subsetting for your solution – RInatM Sep 20 '13 at 12:19
@RInatM You didn't mention big data ;-). I'll see what I can do, although I'm no data.table expert. – Ari B. Friedman Sep 20 '13 at 12:32
@AriB.Friedman `data[idx,]` (where `idx` is row numbers) isn't any faster with `data.table` than it is with `data.frame`. – Matt Dowle Sep 20 '13 at 18:06
@MatthewDowle Is it faster to do with `.SD` as in eddi's answer? – Ari B. Friedman Sep 20 '13 at 18:28
@AriB.Friedman Yes I'd expect so. When we say `.SD` is slow, we mean slow_er_ than using column names directly in `j`. But `.SD` is a lot faster than `by()`. – Matt Dowle Sep 20 '13 at 18:47
@MatthewDowle Would using `.I` be faster than `.SD` in this case? – Ari B. Friedman Sep 21 '13 at 03:36
1

`data[.I]` would create new memory for each group (like `split` and `by` do) so would negate the benefits of `data.table` grouping. `.SD` is allocated once up front for the largest group and reused for each group. If all the columns of `data` are used, then `.SD` is recommended and as fast as it can be. – Matt Dowle Sep 21 '13 at 07:11

R bootstrap statistics by group for big data

3 Answers3

Linked