Bootstrapping multiple columns in data.table in a scalable fashion R

Question

This is a follow up question to this one. In the original question the OP wanted to perform bootstrap on two columns x1 and x2 that are fixed:

set.seed(1000)
data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2))]}
data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

However, I think this problem can be nicely extended to handle any number of columns by treating them as groups. For instance, lets use the iris dataset. Say I want to calculate bootstrap mean for all four dimensions for each species. I can use melt to flip the data and then use the Species, variable combination to get the mean in one go - I think this approach will scale well.

data(iris)
iris = data.table(iris)
iris[,mean(Sepal.Length),by=Species]
iris[,ID:=.N,]
iris_deep = melt(iris
                 ,id.vars = c("ID","Species")
                 ,measure.vars = c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width"))
#define a mean bootstrap function
stat <- function(x, i) {x[i, m=mean(value),]}
iris_deep[, list(list(boot(.SD, stat, R = 100))), by = list(Species,variable)]$V1

Here is my attempt at doing this. However the bootstrapping part does not seem to be working. As R throws the following error:

Error in mean(value) : object 'value' not found

Can someone please take a crack at this?

renato vitolo · Answer 1 · 2016-08-17T20:10:46.560

1

I tried this (with added braces enclosing m=mean(value)) and it appears to work:

stat <- function(x, i) {x[i, (m=mean(value))]}

edited Aug 17 '16 at 20:10

answered Aug 17 '16 at 08:24

renato vitolo

1,744
11
16

score 0 · Answer 2 · answered Sep 02 '20 at 22:06

We can utilize each bootstrap fully, and calculate the mean for each variable within each group, instead of rerunning the bootstrap for each variable.

So if we do something like this, it calculates the mean for each variable:

iris = data.table(iris)
iris[sample(nrow(iris),replace=TRUE),lapply(.SD,mean,na.rm=TRUE),by=Species]

Because boot requires a vector / matrix output, we need to modify the output above, and provide names for the vector:

d = function(dat,ind){
 k = dat[ind,lapply(.SD,mean,na.rm=TRUE),by=Species]
 k_vec = unlist(k[,-1])
 names(k_vec) = paste(rep(colnames(k)[-1],each=nrow(k)),rep(k$Species,(ncol(k)-1)),sep="_")
 k_vec
}

d(iris,sample(nrow(iris),replace=TRUE))
 Sepal.Length_versicolor  Sepal.Length_virginica     Sepal.Length_setosa 
              5.8784314               6.4851852               4.9688889 
 Sepal.Width_versicolor   Sepal.Width_virginica      Sepal.Width_setosa 
              2.7392157               2.9814815               3.3977778 
Petal.Length_versicolor  Petal.Length_virginica     Petal.Length_setosa 
              4.1980392               5.5037037               1.4644444 
 Petal.Width_versicolor   Petal.Width_virginica      Petal.Width_setosa 
              1.2960784               2.0944444               0.2333333

And use boot with strata = iris$Species to ensure the Species are sampled evenly:

bo_strata = boot(iris,d,R=1000,strata=iris$Species)

We can compare the distributions of this approach compared to that in the question:

stat <- function(x, i) {x[i, (m=mean(value))]}
bo_melt = iris_deep[, list(list(boot(.SD, stat, R = 1000))), by = list(Species,variable)]$V1

par(mfrow=c(4,3))
par(mar=c(3,3,3,3))
for(i in 1:ncol(bo_strata$t)){
plot(density(bo_strata$t[,i]),main=names(bo_strata$t0)[i],col="#43658b")
lines(density(bo_melt[[i]]$t),col="#ffa372")
legend("topright",fill=c("#43658b","#ffa372"),c("strata","other"))
}

Bootstrapping multiple columns in data.table in a scalable fashion R

2 Answers2

Linked