This question gives an example on how to convert the by.x =
and by.y =
arguments in base R
merge
to data.table
syntax, to specify differently named columns as the join key:
data.table merge by multiple columns
However, I can't work out how to do the same for a non-equi join, and I am very confused by the output.
Example data:
set.seed(0)
tmp_dt1<- data.table(grp = c(1,2), time = runif(100))
tmp_dt2 <- data.table(grp = c(1,2), time = c(0.1, 0.5))
tmp_dt2 <- tmp_dt2[, time_to := time + 0.2]
tmp_dt2 <- tmp_dt2[, time_from := time] # for clarity, rename time variable
I would like to equi-join the two tables by grp
, followed by a non-equi join so that I only keep time
from tmp_dt1
where it falls between time_to
and time_from
. From what I can tell tmp_dt1[tmp_dt2, , on = c("grp", "time>=time", "time<=time_to")]
does what I want:
> tmp_dt1[tmp_dt2, , on = c("grp", "time>=time", "time<=time_to")]
grp time time.1 time_from
1: 1 0.1 0.3 0.1
2: 1 0.1 0.3 0.1
3: 1 0.1 0.3 0.1
4: 1 0.1 0.3 0.1
5: 1 0.1 0.3 0.1
6: 1 0.1 0.3 0.1
7: 1 0.1 0.3 0.1
...
What confuses me is that x.time
is missing, and the resulting column names are very confusing. For example, why is there a column called time.1
? I would like to clarify the syntax so that tmp_dt1[tmp_dt2, , on = c("grp", "time>=y.time", "time<=y.time_to")]
produces:
grp y.time y.time_to time_from
1: 1 0.1 0.3 0.1
2: 1 0.1 0.3 0.1
3: 1 0.1 0.3 0.1
4: 1 0.1 0.3 0.1
5: 1 0.1 0.3 0.1
6: 1 0.1 0.3 0.1
7: 1 0.1 0.3 0.1
...
and some how also extract the column x.time
, in addition to all columns in y
. Unfortunately this fails with the error:
> tmp_dt1[tmp_dt2, , on = c("grp", "time>=y.time", "time<=y.time_to")]
Error in `[.data.table`(tmp_dt1, tmp_dt2, , on = c("grp", "time>=y.time", :
Column(s) [y.time,y.time_to] not found in i
Trying the following also does not produce what I expect, instead I get:
> tmp_dt1[tmp_dt2, .(grp, time, time_from = i.time, time_to = i.time_to), on = c("grp", "time>=time", "time<=time_to")]
grp time time_from time_to
1: 1 0.1 0.1 0.3
2: 1 0.1 0.1 0.3
3: 1 0.1 0.1 0.3
4: 1 0.1 0.1 0.3
5: 1 0.1 0.1 0.3
6: 1 0.1 0.1 0.3
7: 1 0.1 0.1 0.3
where the time
column bears no resemblance to tmp_dt1$time
.