1

This question gives an example on how to convert the by.x = and by.y = arguments in base R merge to data.table syntax, to specify differently named columns as the join key:

data.table merge by multiple columns

However, I can't work out how to do the same for a non-equi join, and I am very confused by the output.

Example data:

set.seed(0)
tmp_dt1<- data.table(grp = c(1,2), time = runif(100))
tmp_dt2 <- data.table(grp = c(1,2), time = c(0.1, 0.5))
tmp_dt2 <- tmp_dt2[, time_to := time + 0.2]
tmp_dt2 <- tmp_dt2[, time_from := time] # for clarity, rename time variable

I would like to equi-join the two tables by grp, followed by a non-equi join so that I only keep time from tmp_dt1 where it falls between time_to and time_from. From what I can tell tmp_dt1[tmp_dt2, , on = c("grp", "time>=time", "time<=time_to")] does what I want:

> tmp_dt1[tmp_dt2, , on = c("grp", "time>=time", "time<=time_to")]
    grp time time.1 time_from
 1:   1  0.1    0.3       0.1
 2:   1  0.1    0.3       0.1
 3:   1  0.1    0.3       0.1
 4:   1  0.1    0.3       0.1
 5:   1  0.1    0.3       0.1
 6:   1  0.1    0.3       0.1
 7:   1  0.1    0.3       0.1
...

What confuses me is that x.time is missing, and the resulting column names are very confusing. For example, why is there a column called time.1? I would like to clarify the syntax so that tmp_dt1[tmp_dt2, , on = c("grp", "time>=y.time", "time<=y.time_to")] produces:

    grp y.time y.time_to time_from
 1:   1  0.1    0.3       0.1
 2:   1  0.1    0.3       0.1
 3:   1  0.1    0.3       0.1
 4:   1  0.1    0.3       0.1
 5:   1  0.1    0.3       0.1
 6:   1  0.1    0.3       0.1
 7:   1  0.1    0.3       0.1
...

and some how also extract the column x.time, in addition to all columns in y. Unfortunately this fails with the error:

> tmp_dt1[tmp_dt2, , on = c("grp", "time>=y.time", "time<=y.time_to")]
Error in `[.data.table`(tmp_dt1, tmp_dt2, , on = c("grp", "time>=y.time",  : 
  Column(s) [y.time,y.time_to] not found in i

Trying the following also does not produce what I expect, instead I get:

> tmp_dt1[tmp_dt2, .(grp, time, time_from = i.time, time_to = i.time_to), on = c("grp", "time>=time", "time<=time_to")]
    grp time time_from time_to
 1:   1  0.1       0.1     0.3
 2:   1  0.1       0.1     0.3
 3:   1  0.1       0.1     0.3
 4:   1  0.1       0.1     0.3
 5:   1  0.1       0.1     0.3
 6:   1  0.1       0.1     0.3
 7:   1  0.1       0.1     0.3

where the time column bears no resemblance to tmp_dt1$time.

Alex
  • 15,186
  • 15
  • 73
  • 127
  • ok, as far as I can tell, the following rules for join key matching apply if the `on` condition is supplied using `.()` or `c()` in `X[Y]`. If each element contains one variable, then look for that variable in both tables. If variables are supplied on either side of an equation, then look for LHS variable in `X` table and RHS variable in `Y` table. The variable extraction and final column names are still a mystery to me. – Alex Jun 20 '17 at 01:37
  • 1
    Final column names are always from Y. There is one join column for each entry in `on=`. – Frank Jun 20 '17 at 02:18
  • You might want to clarify the expected result. Here's one fairly reasonable thing to go for: `tmp_dt1[tmp_dt1[tmp_dt2, on = .(grp, time >= time_from, time <= time_to), which=TRUE]]` subsetting the first table. I doubt there's much point in hanging onto the lower and upper bounds as repeating values. – Frank Jun 20 '17 at 02:49
  • thanks @Frank, I changed a few sentences. What I meant was that I wanted the original `time` column from the first data table. – Alex Jun 20 '17 at 02:52
  • @Frank, what do you mean by "final column names are always from Y". For example, `tmp_dt1[tmp_dt2, on = .(grp, time >= time_from, time <= time_to)]` does not produce a `time_from` and `time_to` column. – Alex Jun 20 '17 at 03:34
  • It looks like something like manually setting the column names: `tmp_dt1[tmp_dt2[, time := NULL], .(time = x.time, time_from, time_to), on = .(grp, time >= time_from, time <= time_to)]` does what I want, but I am still very confused on why this works, and why it is necessary to use `x.time` even when `time` is not a column in `Y`. – Alex Jun 20 '17 at 03:39
  • 1
    Yeah, sorry, I meant to say final column values. A join is `x[i]` (using the actual argument names), and it can be thought of as using rows of `i` to look up rows of `x`. For this reason, we end up with values from `i` in the result. At least that's how I think of it. To me, the column names are a sideshow and fairly easily customized... – Frank Jun 20 '17 at 03:45
  • 1
    I understand about confusing names (something that needs fixing). In the meanwhile, see if [this post](https://stackoverflow.com/a/44343424/559784) helps.. – Arun Jun 20 '17 at 21:50

2 Answers2

1

I think you want something like this, using dplyr:

library(dplyr)
merged <- inner_join(tmp_dt1, tmp_dt2, by="grp") %>%
          rowwise() %>% 
          filter(between(time.x, time_from, time_to)) %>%
          ungroup()

inner_join equi-joins by grp1. rowwise() specifies that I want the following statements to occur by row. filter() will filter rows by a condition. The condition uses between which asks is time.x >= time_from and time.x <= time_to. Finally, ungroup by row, in case you want a normal data.frame.

CPak
  • 13,260
  • 3
  • 30
  • 48
  • Thank you for this, unfortunately I am pretty sure that this non-equi join using dplyr requires far more memory than the data.table version. – Alex Jun 20 '17 at 02:30
  • Why not use data.table join and then pipe to dplyr filter? – CPak Jun 20 '17 at 02:34
  • the data.table join (non-equi part) already does the filtering. In any case, my problem is controlling the input and outputs of the join. – Alex Jun 20 '17 at 02:36
1

To prevent confusion i would suggest to rename columns that have the same name in both data.tables and create copies of the columns of non-equi join columns.

setnames(tmp_dt2, "time", "time_dt2") tmp_dt2[, c("time_from_join", "time_to_join"):=list(time_from, time_to)] tmp_dt1[ , time_join := time]

Then we can join and then throw away all the temporary columns that data.table messes with in the non-equi join.

tmp_dt1[tmp_dt2, on=.(grp==grp, time_join >= time_from_join, time_join <= time_to_join)][ , c("grp", "time", "time_from", "time_to", "time_dt2")]

grp time time_from time_to time_dt2 1: 1 0.1079436 0.1 0.3 0.1 2: 1 0.1216919 0.1 0.3 0.1 3: 1 0.1255551 0.1 0.3 0.1 4: 1 0.1433044 0.1 0.3 0.1 ...