data.table in R: creating variables from x into i

Question

I have two data tables,

a <- data.table(id=c(1,2,1,2,1,2), time=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:10:00","2018-01-01 01:11:00","2018-01-01 01:11:00","2018-01-01 01:12:00","2018-01-01 01:12:00")), beg=as.POSIXct(c("2018-01-01 01:00:00","2018-01-01 01:05:00","2018-01-01 01:00:00","2018-01-01 01:05:00","2018-01-01 01:01:00","2018-01-01 01:05:00")), end=as.POSIXct(c("2018-01-01 02:00:00","2018-01-01 02:05:00","2018-01-01 02:00:00","2018-01-01 02:05:00","2018-01-01 02:00:00","2018-01-01 02:05:00")))

> a
   id                time                 beg                 end
1:  1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00
2:  2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00
3:  1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00
4:  2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00
5:  1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00
6:  2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00

which has 650m lines by 4 columns, and

b <- data.table(id=c(1,2), abeg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:11:00")), aend=as.POSIXct(c("2018-01-01 01:11:00","2018-01-01 01:12:00")))

> b
   id                abeg                aend
1:  1 2018-01-01 01:10:00 2018-01-01 01:11:00
2:  2 2018-01-01 01:11:00 2018-01-01 01:12:00

which has about 13m lines by 7 columns.

I would like to join b into a but keep all lines and columns of a. I understand that this is a left-join and would execute it as

b[a, .(id=i.id, time=i.time, beg=i.beg, end=i.end, abeg=x.abeg, aend=x.aend), on=.(id=id, abeg<=time, aend>=time)]

to obtain

   id                time                 beg                 end                abeg                aend
1:  1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
2:  2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00                <NA>                <NA>
3:  1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
4:  2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
5:  1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00                <NA>                <NA>
6:  2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00

However, executing this on a Mac takes longer than 7 hours, when I had to abort. I joined on a 50m-rows-subset of a and this took about 8 minutes. I would like to avoid a loop over subsets, so I wonder whether I can make it more efficient.

For example, I suspect the assign command := can be used somehow. In data.table join then add columns to existing data.frame without re-copy it is explained how this can be done when all variables in b are kept and amended by variables from a. However, I seem to have the reverse case: I want to keep all columns in a and amend it by columns from b.

"my sense is this is not efficient" Why are you relying on your "sense"? Benchmark it. — Roland, Dec 05 '18 at 10:09
But wouldn`t I first need to know an alterntive to be able to compare? I rephrased now. — bumblebee, Dec 05 '18 at 10:17
Well, time your solution and compare it to what you need. No reason to optimize something that is fast enough. If it is not "fast enough" include your requirements in the question. — Roland, Dec 05 '18 at 10:21
Okay, I added some detail. Thanks -- I am rather new to this! — bumblebee, Dec 05 '18 at 10:25

score 0 · Answer 1 · answered Dec 06 '18 at 08:12

Here's an join with update by reference which I think does what you intend to do:

a[b, on=.(id=id, time>=abeg, time<=aend), `:=`(abeg = i.abeg, aend = i.aend)]

The resulting a is then:

   id                time                 beg                 end                abeg                aend
1:  1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
2:  2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00                <NA>                <NA>
3:  1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
4:  2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
5:  1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00                <NA>                <NA>
6:  2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00

data.table in R: creating variables from x into i

1 Answers1