(Efficiently) merge random keyed subset

Question

I have two data.tables; I'd like to assign an element of one to the other at random from among those that match keys. The way I'm doing so right now is quite slow.

Let's get specific; here's some sample data:

dt1<-data.table(id=sample(letters[1:5],500,replace=T),var1=rnorm(500),key="id")
dt2<-data.table(id=c(rep("a",4),rep("b",8),rep("c",2),rep("d",5),rep("e",7)),
                place=paste(sample(c("Park","Pool","Rec Center","Library"),
                                   26,replace=T),
                            sample(26)),key="id")

I want to add two randomly chosen places to dt1 for each observation, but the places have to match on id.

Here's what I'm doing now:

get_place<-function(xx) sapply(xx,function(x) dt2[.(x),sample(place,1)])

dt1[,paste0("place",1:2):=list(get_place(id),get_place(id))]

This works, but it's quite slow--took 66 seconds to run on my computer, basically an eon.

One issue seems to be I can't seem to take proper advantage of keying:

Something like dt2[.(dt1$id),mult="random"] would be perfect, but it doesn't appear to be possible.

Any suggestions?

Frank · Accepted Answer · 2015-11-06T01:24:49.577

A simple answer

dt2[.(dt1),as.list(c(
  place=sample(place,size=2,replace=TRUE)
)),by=.EACHI,allow.cartesian=TRUE]

This approach is simple and illustrates data.table features like Cartesian joins and by=.EACHI, but is very slow because for each row of dt1 it (i) samples and (ii) coerces the result to a list.

A faster answer

nsamp <- 2
dt3   <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
dt1[.(dt3),paste0("place",1:nsamp):=
  replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
,by=.EACHI]

Using replicate with simplify=FALSE (as also in @bgoldst's answer) makes the most sense:

It returns a list of vectors which is the format data.table requires when making new columns.
replicate is the standard R function for repeated simulations.

Benchmarks. We should look at varying several features and not modify dt1 as we go along:

# candidate functions
frank2 <- function(){
  dt3   <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
  dt1[.(dt3),
    replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
  ,by=.EACHI]
}
david2 <- function(){
  indx <- dt1[,.N, id]
  sim <- dt2[.(indx),
    replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE)
  ,by=.EACHI]
  dt1[, sim[,-1,with=FALSE]]
}
bgoldst<-function(){
  dt1[,
    replicate(2,ave(id,id,FUN=function(x) 
      sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=F)
  ]
}

# simulation
size <- 1e6
nids <- 1e3
npls <- 2:15

dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]

# benchmarking
res <- microbenchmark(frank2(),david2(),bgoldst(),times=10)
print(res,order="cld",unit="relative")

which gives

Unit: relative
      expr      min       lq     mean   median       uq      max neval cld
 bgoldst() 8.246783 8.280276 7.090995 7.142832 6.579406 5.692655    10   b
  frank2() 1.042862 1.107311 1.074722 1.152977 1.092632 0.931651    10  a 
  david2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a

And if we switch around the parameters...

# new simulation
size <- 1e4
nids <- 10
npls <- 1e6:2e6

dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]

# new benchmarking
res <- microbenchmark(frank2(),david2(),times=10)
print(res,order="cld",unit="relative")

we see

Unit: relative
     expr    min     lq     mean   median       uq     max neval cld
 david2() 3.3008 3.2842 3.274905 3.286772 3.280362 3.10868    10   b
 frank2() 1.0000 1.0000 1.000000 1.000000 1.000000 1.00000    10  a

As one might expect, which way is faster -- collapsing dt1 in david2 or collapsing dt2 in frank2 -- depends on how much information is compressed by collapsing.

I was sandboxing with `allow.cartesian` but have never actually used `by=.EACHI`, great example! — MichaelChirico, May 20 '15 at 19:50
The only thing missing here is probably assignment by reference. — David Arenburg, May 20 '15 at 19:53
Thanks :) Fyi, the syntax here may be nice, but this is probably not very efficient, since (i) it does not vectorize the call to `sample` as in David's answer and (ii) it's coercing to list...though I don't know how to get around that. — Frank, May 20 '15 at 19:54
Hmm.. Yes, I was wondering if `size = 2` just sampling twice from *each i* here (hence the name). — David Arenburg, May 20 '15 at 20:01
@akrun Thanks. I've overhauled and cleaned it up. Mine wins, but I would have reported the benchmark anyways, as I have done over here, where Josh's is by far the best answer, votes and checkmark notwithstanding: http://stackoverflow.com/a/30171117/1191259 — Frank, May 20 '15 at 21:43
Btw, your approach is better when there are a small number of column to create, it will scale worse as the number increase. — David Arenburg, May 21 '15 at 10:36
@DavidArenburg Oh yeah, I forgot to vary that dimension; it's not obvious to me why one one be better than the other on it, though. The `i0` and `N` are a succinct description of the indices of the `dt2$place` associated with a given `dt2$id` -- `i0` is 1-the_first_index, and the full vector of the indices is `i0+1:N`. This works because `dt2` is keyed/sorted. Your solution doesn't need to deal with index shenanigans since the stuff in `sim` lines up with the rows of `dt1` (thanks to both `dt1` and `dt2` being keyed). There might be a better way I could've done it in mine. Dunno. — Frank, May 21 '15 at 14:15
I tried running your solution without the `i0` part and it works fine as far as I can tell. But your solution is very fast. I've tried another approach using `set` and some other approach, but yours beats them — David Arenburg, May 21 '15 at 14:18
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/78434/discussion-between-frank-and-david-arenburg). — Frank, May 21 '15 at 14:24

score 3 · Answer 2 · edited May 20 '15 at 20:13

The perfect function for this purpose is ave(), since it allows running a function for each group of a vector, and automatically maps the return value back to the elements of the group:

set.seed(1);
dt1 <- data.table(id=sample(letters[1:5],500,replace=T), var1=rnorm(500), key='id' );
dt2 <- data.table(id=c(rep('a',4),rep('b',8),rep('c',2),rep('d',5),rep('e',7)), place=paste(sample(c('Park','Pool','Rec Center','Library'),26,replace=T), sample(26) ), key='id' );
dt1[,paste0('place',1:2):=replicate(2,ave(id,id,FUN=function(x) sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=FALSE)]
dt1;
##      id       var1        place1        place2
##   1:  a -0.4252677 Rec Center 23       Park 12
##   2:  a -0.3892372       Park 12    Library 22
##   3:  a  2.6491669       Park 14 Rec Center 23
##   4:  a -2.2891240 Rec Center 23       Park 14
##   5:  a -0.7012317    Library 22       Park 12
##  ---
## 496:  e -1.0624084    Library 16    Library 16
## 497:  e -0.9838209     Library 4    Library 26
## 498:  e  1.1948510    Library 26       Pool 21
## 499:  e -1.3353714       Pool 18    Library 26
## 500:  e  1.8017255       Park 20       Pool 21

This should work with data.frames as well as data.tables.

Edit: Adding benchmarking

This solution seems fastest, at least after having made the correction suggested by Frank below.

frank<-function(){dt2[.(dt1),as.list(c(
  place=sample(place,size=2,replace=TRUE))),
  by=.EACHI,allow.cartesian=TRUE]}
david<-function(){
  dt1[,paste0("place",1:2):=
        lapply(1:2,function(x) get_place(id,.N)),by=id]}
bgoldst<-function(){dt1[,paste0("place",1:2):=
                          replicate(2,ave(id,id,FUN=function(x) 
                            sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),
                                    simplify=F)]}

microbenchmark(times=1000L,frank(),david(),bgoldst())

Unit: milliseconds
      expr      min       lq     mean   median       uq      max neval cld
   frank() 5.125843 5.353918 6.276879 5.496042 5.772051 15.57155  1000  b 
   david() 6.049172 6.305768 7.172360 6.455687 6.669202 93.06398  1000   c
 bgoldst() 1.421330 1.521046 1.847821 1.570573 1.628424 89.60315  1000 a

Your benchmark is incorrect. You should test on a bigger data set. instead of 500, try testing on `1e5` or bigger and then see which is faster — David Arenburg, May 20 '15 at 20:36
Fyi, in my benchmarking, I also looked at a variant of `bgoldst()` with `sample(dt2[.(x[1])]$place,length(x),replace=T)` in place of `sample(dt2$place[dt2$id==x[1]],length(x),replace=T)` and found it to be ~15x as fast. — Frank, May 20 '15 at 22:34

score 3 · Answer 3 · edited May 20 '15 at 22:00

3

When you are running sapply over each row, you are basically not using any data.table capabilities here. Alternatively, you can use both the binary join and the by parameter by sampling only once per id. You could define get_place as follows

get_place <- function(tempid, N) dt2[.(tempid), sample(place, N, replace = TRUE)]

Then simply do

dt1[, place1 := get_place(id, .N), by = id]

Or a general solution would be

indx <- 1:2
dt1[, paste0("place", indx) := lapply(indx, function(x) get_place(id, .N)), by = id]

Here's a benchmark on a bit bigger dt1

size = 1e6
set.seed(123)
dt1 <- data.table(id=sample(letters[1:5],size,replace=TRUE),var1=rnorm(size),key="id")

Using the same functions as defined in @bgoldst answer

microbenchmark(times = 10L, frank(), david(), bgoldst())
# Unit: milliseconds
# expr              min         lq       mean     median         uq        max neval
# frank()   11627.68324 11771.4227 11887.1232 11804.6342 12012.4636 12238.1031    10
# david()      84.62109   122.1117   121.1003   123.5861   128.0042   132.3591    10
# bgoldst()   372.02267   400.8867   445.6231   421.3168   445.9076   709.5458    10

Here is another, faster variant on the same idea (as seen in @Frank's benchmark):

indx<- dt1[,.N, id]
sim <- dt2[.(indx),replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE),by=.EACHI]
dt1[,paste0("place",1:2):=`[.listof`(sim,-1)]

edited May 20 '15 at 22:00

Frank

66,179
8
96
180

answered May 20 '15 at 19:29

David Arenburg

91,361
17
137
196

1

this is perfect. so simple; i guess i need more sleep! – MichaelChirico May 20 '15 at 19:33
Just curious: does that use the new version of my answer that I appended just recently? – Frank May 20 '15 at 20:43
@Frank, no, I didn't see it until now – David Arenburg May 20 '15 at 20:43
@Frank I'm actually surprised that `ave/replicate` combination is so efficient, as they are both just loops combined with `split`. I'm sure it's just late and I'm missing something obvious here. – David Arenburg May 20 '15 at 20:57
I'm not so surprised that `ave`/`replicate` is good relative to data.table's `by`, as `ave` has a much simpler job and so should have less overhead. I am more surprised that `dt2$place[dt2$id==x[1]]` isn't awful as I scale the simulation up -- seems obvious that one should use a join instead there. Anyways, now that I've switched the order from `dt2[myrows]$place` to `dt2$place[myrows]`, I'm seeing it as the fastest by quite a distance. – Frank May 20 '15 at 21:13
1

And similarly, `replicate` should be faster than `lapply` with a function that ignores its argument, since `replicate` is designed specifically for that task, I guess. – Frank May 20 '15 at 21:20
@Frank, not in front of a computer right now (in bed actually), but Im starting to think that there is no need in joins at all. One could do `indx <- dt1[, .N, id]` then join this to `dt2` only once while assigning `N` per id. Then you can sample N per each id using only `dt2`how many times you want without the need to join to `dt1` each time. – David Arenburg May 20 '15 at 21:42
For my example, the approach you mention wins. I don't know where else to put it, so I'll edit it into your answer. Please edit as you see fit. – Frank May 20 '15 at 21:59

(Efficiently) merge random keyed subset

3 Answers3

Linked