Adding an aggregated field to a data.table from another table

Question

I have a small table with subject IDs and a big table that has "events" for those subjects (i.e. each record contains event date, subject ID and many other details) Something like this:

set.seed(8)

n <- 20L
dt.subjects <- data.table(sbjID=c(sample(LETTERS,n/2),sample(letters,n/2)))

N <- 1e7L
dt.events <- data.table(
  sbjID  =sample(LETTERS,N,T),
  relDate=sample(seq_len(1000),N,replace=T)
  )
dt.events[, evDate:=as.Date('1990-01-01') + relDate]

#    sbjID relDate     evDate
# 1:     L     975 1992-09-02
# 2:     G     231 1990-08-20
# 3:     H     379 1991-01-15
# 4:     S     916 1992-07-05
# 5:     F     619 1991-09-12
# 6:     G     200 1990-07-20

Now I want to add some calculated field (say, last event for each subject) to the first table. In the early days when I was not very familiar with data.table syntax, I used merge():

dt.subjects <- merge(
  dt.subjects,
  dt.events[,.(lastDate=max(evDate)), by=sbjID],
  by = 'sbjID',
  all.x=T
)

it does the job but looks quite ugly, and also causes troubles if lastDate column is already there. Nowadays I use the following:

dt.subjects[, 
        lastDate:=(dt.events[sbjID,max(evDate), on='sbjID']), 
        by=sbjID]

I think this lastDate:= reads much more intuitively than merge(). But... it is much slower! (5 times slower when I use the dummy datasets given here, 20 times slower in my actual dataset) Is there any solution that would be faster than my second one but less ugly than the first one?

Does `dt.subjects[dt.Events[, .(lastDate = max(evDate)), by = sbjID], lastDate := lastDate, on = .(sbjID)]` this works for you? It seems to be comparable wrt to `merge` time. Regarding the code ugliness, it is very subjective — akrun, Dec 23 '21 at 22:04
I suggest that your 10M row data is grossly invariant: all max dates appear to be the same. Perhaps come up with a much-smaller dataset? Also, please use `set.seed` so we can see the same random data, and then provide expected output (actual values) based on that random sample. — r2evans, Dec 23 '21 at 22:07

score 1 · Accepted Answer · answered Dec 24 '21 at 07:18

The OP's code does one join for each group:

system.time(
    dt.subjects[, 
        lastDate:=(dt.events[sbjID,max(evDate), on='sbjID']), 
    by=sbjID] 
)
#    user  system elapsed 
#    2.82    0.50    1.31

Using by=.EACHI we can drop this to a single join:

system.time(
    dt.subjects[, 
        lastDate := dt.events[.SD, max(evDate), on='sbjID', by=.EACHI]$V1
    ] 
)
#    user  system elapsed 
#    0.18    0.05    0.08

I wrote more details on this idiom in another answer.

akrun · Answer 2 · 2021-12-23T22:29:28.143

If we want a faster option than the second one,

system.time({
   dt.subjects2[dt.Events[, .(lastDate = max(evDate)), by = sbjID], 
              lastDate := lastDate, on = .(sbjID)]


})
user  system elapsed 
  0.339   0.016   0.355 

dt.subjects2$lastDate <- NULL
system.time({
dt.subjects2[dt.Events, lastDate := max(evDate), on = .(sbjID), by = .EACHI]
 })
   user  system elapsed 
 17.683   0.078  17.705 

system.time({
 out <- merge(
  dt.subjects,
  dt.Events[,.(lastDate=max(evDate)), by=sbjID],
  by = 'sbjID',
  all.x=TRUE
)


})
 user  system elapsed 
  0.335   0.022   0.357

The efficiency can be improved if we use fmax (from collapse)

library(collapse)
 system.time({
   tmp <- fmax(dt.Events$evDate, g = dt.Events$sbjID)
   tmp <- data.table(sbjID = names(tmp), 
           lastDate = as.Date(tmp, origin = '1970-01-01'))
   
   dt.subjects2[tmp, lastDate := lastDate, on = .(sbjID)]
 
 
 })
   user  system elapsed 
  0.143   0.005   0.147

data

n <- 20L
set.seed(24)
dt.subjects <- data.table(sbjID=c(sample(LETTERS,n/2),sample(letters,n/2)))

N <- 1e7L
dt.Events <- data.table(
  sbjID  =sample(LETTERS,N,T),
  relDate=sample(seq_len(1000),N,replace=T)
  )
dt.Events[, evDate:=as.Date('1990-01-01') + relDate]



dt.subjects2 <- copy(dt.subjects)

Adding an aggregated field to a data.table from another table

2 Answers2

data