I have two data tables,
a <- data.table(id=c(1,2,1,2,1,2), time=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:10:00","2018-01-01 01:11:00","2018-01-01 01:11:00","2018-01-01 01:12:00","2018-01-01 01:12:00")), beg=as.POSIXct(c("2018-01-01 01:00:00","2018-01-01 01:05:00","2018-01-01 01:00:00","2018-01-01 01:05:00","2018-01-01 01:01:00","2018-01-01 01:05:00")), end=as.POSIXct(c("2018-01-01 02:00:00","2018-01-01 02:05:00","2018-01-01 02:00:00","2018-01-01 02:05:00","2018-01-01 02:00:00","2018-01-01 02:05:00")))
> a
id time beg end
1: 1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00
2: 2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00
3: 1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00
4: 2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00
5: 1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00
6: 2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00
which has 650m lines by 4 columns, and
b <- data.table(id=c(1,2), abeg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:11:00")), aend=as.POSIXct(c("2018-01-01 01:11:00","2018-01-01 01:12:00")))
> b
id abeg aend
1: 1 2018-01-01 01:10:00 2018-01-01 01:11:00
2: 2 2018-01-01 01:11:00 2018-01-01 01:12:00
which has about 13m lines by 7 columns.
I would like to join b
into a
but keep all lines and columns of a
. I understand that this is a left-join and would execute it as
b[a, .(id=i.id, time=i.time, beg=i.beg, end=i.end, abeg=x.abeg, aend=x.aend), on=.(id=id, abeg<=time, aend>=time)]
to obtain
id time beg end abeg aend
1: 1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
2: 2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00 <NA> <NA>
3: 1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
4: 2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
5: 1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00 <NA> <NA>
6: 2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
However, executing this on a Mac takes longer than 7 hours, when I had to abort. I joined on a 50m-rows-subset of a
and this took about 8 minutes. I would like to avoid a loop over subsets, so I wonder whether I can make it more efficient.
For example, I suspect the assign command :=
can be used somehow. In data.table join then add columns to existing data.frame without re-copy it is explained how this can be done when all variables in b
are kept and amended by variables from a
. However, I seem to have the reverse case: I want to keep all columns in a
and amend it by columns from b
.