The OP has asked two questions:
- Is using the
%within%
operator from lubridate
the best approach?
- Is it possible to automate it so that the OP does not have to hand write all 60 intervals?
To answer the second question first: Yes, it is possible:
%within%
, lapply()
and interval()
The OP was almost there. According to the documentation of a %within% b
,
If b
is a list of intervals, a
is checked if it falls within any of
the intervals
We can create a list of intervals from the given date vector df2$event2
by
lapply(df2$event2, function(x) interval(x - 10, x))
[[1]]
[1] 1980-01-06 UTC--1980-01-16 UTC
[[2]]
[1] 1980-01-08 UTC--1980-01-18 UTC
[[3]]
[1] 1980-01-09 UTC--1980-01-19 UTC
[[4]]
[1] 1980-02-02 UTC--1980-02-12 UTC
[[5]]
[1] 1980-09-16 UTC--1980-09-26 UTC
[[6]]
[1] 1980-10-13 UTC--1980-10-23 UTC
The start
date of each interval is computed from the end
date by subtracting 10 days. So,
library(lubridate)
df1$event1 %within% lapply(df2$event2, function(x) interval(x - 10, x))
returns a logical vector (in line with OP's expected result)
[1] TRUE TRUE TRUE FALSE FALSE FALSE
which can be used to subset df1
to pick the matching events from df1
as a vector of dates
df1[df1$event1 %within% lapply(df2$event2, function(x) interval(x - 10, x)), ]
[1] "1980-01-10" "1980-01-13" "1980-01-14"
or
df1[df1$event1 %within% lapply(df2$event2, function(x) interval(x - 10, x)), , drop = FALSE]
which returns the subsetted data.frame.
event1
1 1980-01-10
2 1980-01-13
3 1980-01-14
%inrange%
from data.table
For the sake of completeness, the data.table
package provides a similar operator %inrange
:
library(data.table)
setDT(df1)
setDT(df2)
df1[event1 %inrange% df2[, .(event2 - 10L, event2)]]
event1
1: 1980-01-10
2: 1980-01-13
3: 1980-01-14
setDT(df1)
and setDT(df2)
coerce data.frames to data.table objects.
Benchmarking
Now, we can try to answer OP's first question concerning "the best approach".
The OP has not specified the criteria to judge an approach as "the best". Probably, the OP was mainly concerned about the effort to write 60 intervals by hand.
Now, this issue has ben solved so let's compare the three different approaches posted so far in terms of execution speed:
%within%
and interval()
from lubridate
expand.grid()
suggested by marcguery
%inrange%
from data.table
For benchmarking, the bench
package is used because it measures exection times as well as memory allocation for varying problem sizes. It also checks that the results are identical. Therefore, the three approaches are modified to return a vector of dates.
library(bench)
library(ggplot2)
bm <- press(
n1 = c(100L, 1E3L, 1E4L),
n2 = c(10L, 100L, 1000L),
{
beg <- as.Date("1980-01-01")
end <- as.Date("2020-12-31")
df1 <- data.frame(event1 = seq(beg, end, length.out = n1))
df2 <- data.frame(event2 = seq(beg, end, length.out = n2))
dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)
mark(
within = {
df1[df1$event1 %within% lapply(df2$event2, function(x) interval(x - 10, x)), ]
},
inrange = {
dt1[event1 %inrange% dt2[, .(event2 - 10L, event2)], event1]
},
exp.grid = {
combinations <- expand.grid(df1$event1, df2$event2)
matches <- combinations[combinations[,2] >= combinations[,1] &
combinations[,2] - combinations[,1] <= 10,]
unique(matches[[1L]])
},
check = TRUE
)
}
)
autoplot(bm)

Please, note the logarithmic time scale.
Only for the smallest problem sizes, the expand.grid()
approach is the fastest. For all other problem sizes (including the case of 1000 event1
and 100 event2
which is close to OP's problem size), data.table
's %inrange%
is the fastest. For the largest case with 10000 event1
and 1000 event2
, data.table
is more than 2 magnitudes faster than the other approaches.
library(dplyr)
bm %>%
select(1:11) %>%
filter(n1 == max(n1), n2 == max(n2)) %>%
mutate(expression = names(expression) %>% unique())
# A tibble: 3 x 11
expression n1 n2 min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<chr> <int> <int> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 within 10000 1000 780.16ms 780.2ms 1.28 307MB 2.56 1 2 780ms
2 inrange 10000 1000 2.68ms 3.3ms 293. 491KB 0 147 0 502ms
3 exp.grid 10000 1000 834.35ms 834.3ms 1.20 882MB 3.60 1 3 834ms
Furthermore, data.table
allocates 3 magnitudes less memory (0.5MB vs 307MB or 882MB, resp.).