How to perform join over date ranges using data.table?

Question

How to do the below (straightforward using sqldf) using data.table and get exact same result:

library(data.table)

whatWasMeasured <- data.table(start=as.POSIXct(seq(1, 1000, 100),
    origin="1970-01-01 00:00:00"),
    end=as.POSIXct(seq(10, 1000, 100), origin="1970-01-01 00:00:00"),
    x=1:10,
    y=letters[1:10])

measurments <- data.table(time=as.POSIXct(seq(1, 2000, 1),
    origin="1970-01-01 00:00:00"),
    temp=runif(2000, 10, 100))

## Alternative short names for data.tables
dt1 <- whatWasMeasured
dt2 <- measurments

## Straightforward with sqldf    
library(sqldf)

sqldf("select * from measurments m, whatWasMeasured wwm
where m.time between wwm.start and wwm.end")

Does this solve your problem? http://stackoverflow.com/questions/5123197/matching-time-a-time-in-the-interval-between-a-start-and-end-time — David Robinson, Dec 15 '14 at 15:58
@DavidRobinson Thanks. I have actually seen this question and answer, but unfortunately my data volume is huge and would really prefer a fast data.table solution... — Samo, Dec 15 '14 at 16:00
I know this is a bit unrealted, but am I the only one getting Dec-31-1969 dates? Should it not be jan-1-1970? — Serban Tanasa, Dec 15 '14 at 16:55

score 27 · Accepted Answer · edited Jun 20 '20 at 09:12

27

You can use the foverlaps() function which implements joins over intervals efficiently. In your case, we just need a dummy column for measurments.

Note 1: You should install the development version of data.table - v1.9.5 as a bug with foverlaps() has been fixed there. You can find the installation instructions here.

Note 2: I'll call whatWasMeasured = dt1 and measurments = dt2 here for convenience.

require(data.table) ## 1.9.5+
dt2[, dummy := time]

setkey(dt1, start, end)
ans = foverlaps(dt2, dt1, by.x=c("time", "dummy"), nomatch=0L)[, dummy := NULL]

See ?foverlaps for more info and this post for a performance comparison.

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 15 '14 at 16:06

Arun

116,683
26
284
387

is there any way we could get away with some sort of %between% statement? Had a hard time getting dt2[time %between% dt1[, list(start,end)]] to work as i was hoping. – Serban Tanasa Dec 15 '14 at 17:29
@SerbanTanasa, have a look at [the post](http://stackoverflow.com/a/25655497/559784) I've linked in my answer. It compares the performance of a solution using `between` as well. – Arun Dec 15 '14 at 17:31
Thanks for your answer. Great. It works. Very fast. The only minor issue is that it does not like missing values for start or end or time. maybe a na.rm handling would be nice to have. Also, when doing this exercise on real data I get in the end result two additional unwanted columns named "start" and "end" which I neither have in input data nor are the names of the keys or anything... – Samo Dec 15 '14 at 17:31
1

@Samo, use `na.omit()` along with `cols` argument (if you're using 1.9.5). Will think about `na.rm` argument, thanks. I don't understand - your data.table `whatWasMeasured` has two columns `start` and `end`, and the sqldf output also has the same columns on the example from your post. – Arun Dec 15 '14 at 17:33
@Arun Thanks. And yes, please ignore my comment about phantom start and and appearing unexpectedly. Actually they do not appear. Thanks for understanding. I was testing this on "real" data not the toy example above and obviously I did not refresh all the objects... – Samo Dec 15 '14 at 17:42

How to perform join over date ranges using data.table?

1 Answers1

Linked

Related