4

I am trying to do an inner join using data table that has multiple, fairly dynamic conditions. I am getting tripped up on the syntax. First, I create two objects, x and x2 that I want to do an inner join with.

set.seed(1)
#generate data
x = data.table(CJ(t=1:10, d=1:3,p1s=seq(1,3,by=0.1),p1sLAST=seq(1,3,by=0.1)))
x[d==1,p1sLAST:=3]
x=x[p1s<=p1sLAST]
x2 = data.table(CJ(tprime=1:10, p1sLASTprm=seq(1,3,by=0.1)))

With the objects:

> x
    t d p1s p1sLAST
1:  1 1 1.0     3.0
2:  1 1 1.0     3.0
3:  1 1 1.0     3.0
4:  1 1 1.0     3.0
5:  1 1 1.0     3.0
---                 
9026: 10 3 2.8     2.9
9027: 10 3 2.8     3.0
9028: 10 3 2.9     2.9
9029: 10 3 2.9     3.0
9030: 10 3 3.0     3.0


> x2
    tprime p1sLASTprm
1:      1        1.0
2:      1        1.1
3:      1        1.2
4:      1        1.3
5:      1        1.4
---                  
206:     10        2.6
207:     10        2.7
208:     10        2.8
209:     10        2.9
210:     10        3.0

Now, I want to do these last three steps in a single inner join.

joined = x[,x2[],by=names(x)]
joined=joined[p1sLASTprm==p1s & d!=3 | d==3 & p1sLASTprm==3]
joined=joined[tprime==t+1]

Resulting in the final output:

> joined
       t  d  p1s   p1sLAST  tprime    p1sLASTprm
    1: 1 1    1.0     3.0      2        1.0
    2: 1 1    1.1     3.0      2        1.1
    3: 1 1    1.2     3.0      2        1.2
    4: 1 1    1.3     3.0      2        1.3
    5: 1 1    1.4     3.0      2        1.4
    ---                                  
    4343: 9 3 2.8     2.9     10        3.0
    4344: 9 3 2.8     3.0     10        3.0
    4345: 9 3 2.9     2.9     10        3.0
    4346: 9 3 2.9     3.0     10        3.0
    4347: 9 3 3.0     3.0     10        3.0
wolfsatthedoor
  • 7,163
  • 18
  • 46
  • 90

1 Answers1

5

I do not think a single inner join can accomplish those 3 steps since there is a | and most likely a union of results will be required.

A more memory efficient approach could be:

ux <- unique(x)[, upt := t+1]
rbindlist(list(
    ux[d!=3][x2,
        c(mget(names(ux)), mget(names(x2))),
        on=c("p1s"="p1sLASTprm", "upt"="tprime"),
        nomatch=0L],
    ux[d==3][x2[p1sLASTprm==3],
        c(mget(names(ux)), mget(names(x2))),
        on=c("upt"="tprime"),
        nomatch=0L]
))
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • If I have other columns in the data table x2, how do I ensure they are included? It's too much to type out them all, but could I do something like? `q[d!=3][q2, c(.SD, .(p1sLASTprm=p1s,c2=c2,c3=c3)), on=c("p1s"="p1sLASTprm", "upt"="tprm"), nomatch=0L,allow.cartesian=TRUE]` but instead.... `othercols = c('c1','c2')` and then `q[d!=3][q2, c(.SD, .(p1sLASTprm=p1s,othercols)), on=c("p1s"="p1sLASTprm", "upt"="tprm"), nomatch=0L,allow.cartesian=TRUE]` – wolfsatthedoor Mar 13 '19 at 18:42
  • I added something more general but things might get a little wonky when `ux` and `x2` have the same column names (i.e. the output might contain columns with the same names). But that can be resolved as well :). see this link for the general approach: – chinsoon12 Mar 14 '19 at 01:11
  • https://stackoverflow.com/questions/24833247/how-can-one-work-fully-generically-in-data-table-in-r-with-column-names-in-varia – chinsoon12 Mar 14 '19 at 01:22