A simple answer
dt2[.(dt1),as.list(c(
place=sample(place,size=2,replace=TRUE)
)),by=.EACHI,allow.cartesian=TRUE]
This approach is simple and illustrates data.table
features like Cartesian joins and by=.EACHI
, but is very slow because for each row of dt1
it (i) samples and (ii) coerces the result to a list.
A faster answer
nsamp <- 2
dt3 <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
dt1[.(dt3),paste0("place",1:nsamp):=
replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
,by=.EACHI]
Using replicate
with simplify=FALSE
(as also in @bgoldst's answer) makes the most sense:
- It returns a list of vectors which is the format
data.table
requires when making new columns.
replicate
is the standard R function for repeated simulations.
Benchmarks. We should look at varying several features and not modify dt1
as we go along:
# candidate functions
frank2 <- function(){
dt3 <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
dt1[.(dt3),
replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
,by=.EACHI]
}
david2 <- function(){
indx <- dt1[,.N, id]
sim <- dt2[.(indx),
replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE)
,by=.EACHI]
dt1[, sim[,-1,with=FALSE]]
}
bgoldst<-function(){
dt1[,
replicate(2,ave(id,id,FUN=function(x)
sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=F)
]
}
# simulation
size <- 1e6
nids <- 1e3
npls <- 2:15
dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]
# benchmarking
res <- microbenchmark(frank2(),david2(),bgoldst(),times=10)
print(res,order="cld",unit="relative")
which gives
Unit: relative
expr min lq mean median uq max neval cld
bgoldst() 8.246783 8.280276 7.090995 7.142832 6.579406 5.692655 10 b
frank2() 1.042862 1.107311 1.074722 1.152977 1.092632 0.931651 10 a
david2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
And if we switch around the parameters...
# new simulation
size <- 1e4
nids <- 10
npls <- 1e6:2e6
dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]
# new benchmarking
res <- microbenchmark(frank2(),david2(),times=10)
print(res,order="cld",unit="relative")
we see
Unit: relative
expr min lq mean median uq max neval cld
david2() 3.3008 3.2842 3.274905 3.286772 3.280362 3.10868 10 b
frank2() 1.0000 1.0000 1.000000 1.000000 1.000000 1.00000 10 a
As one might expect, which way is faster -- collapsing dt1
in david2
or collapsing dt2
in frank2
-- depends on how much information is compressed by collapsing.