I have a small table with subject IDs and a big table that has "events" for those subjects (i.e. each record contains event date, subject ID and many other details) Something like this:
set.seed(8)
n <- 20L
dt.subjects <- data.table(sbjID=c(sample(LETTERS,n/2),sample(letters,n/2)))
N <- 1e7L
dt.events <- data.table(
sbjID =sample(LETTERS,N,T),
relDate=sample(seq_len(1000),N,replace=T)
)
dt.events[, evDate:=as.Date('1990-01-01') + relDate]
# sbjID relDate evDate
# 1: L 975 1992-09-02
# 2: G 231 1990-08-20
# 3: H 379 1991-01-15
# 4: S 916 1992-07-05
# 5: F 619 1991-09-12
# 6: G 200 1990-07-20
Now I want to add some calculated field (say, last event for each subject) to the first table.
In the early days when I was not very familiar with data.table syntax, I used merge()
:
dt.subjects <- merge(
dt.subjects,
dt.events[,.(lastDate=max(evDate)), by=sbjID],
by = 'sbjID',
all.x=T
)
it does the job but looks quite ugly, and also causes troubles if lastDate column is already there. Nowadays I use the following:
dt.subjects[,
lastDate:=(dt.events[sbjID,max(evDate), on='sbjID']),
by=sbjID]
I think this lastDate:=
reads much more intuitively than merge()
.
But... it is much slower! (5 times slower when I use the dummy datasets given here, 20 times slower in my actual dataset)
Is there any solution that would be faster than my second one but less ugly than the first one?