The reason that t_id = 1
doesn't show up in the output is because a rolling join takes the row where the key-combination occurs last. From the documentation (emphasis mine):
Applies to the last join column, generally a date but can be any
ordered variable, irregular and including gaps. If roll=TRUE and i's
row matches to all but the last x join column, and its value in the
last i join column falls in a gap (including after the last
observation in x for that group), then the prevailing value in x is
rolled forward. This operation is particularly fast using a modified
binary search. The operation is also known as last observation carried
forward (LOCF).
Let's consider somewhat larger datasets:
> DT
t_id airport thisTime
1: 1 a 5.1
2: 4 a 5.1
3: 3 a 5.1
4: 2 d 6.2
5: 5 d 6.2
> DT_LU
f_id airport thisTime
1: 1 a 6
2: 2 a 6
3: 2 a 8
4: 1 b 7
5: 1 c 8
6: 2 d 7
7: 1 d 9
When you perform a rolling join just like in your question:
DT[DT_LU, nomatch=0, roll=Inf]
you get:
t_id airport thisTime f_id
1: 3 a 6 1
2: 3 a 6 2
3: 3 a 8 2
4: 5 d 7 2
5: 5 d 9 1
As you can see, from both the key combination a, 5.1
and d, 6.2
the last row is used for the joined datatable. Because you use Inf
as roll-value, all the future values are incorporated in the resulting datatable. When you use:
DT[DT_LU, nomatch=0, roll=1]
you see that only the first value in the future is included:
t_id airport thisTime f_id
1: 3 a 6 1
2: 3 a 6 2
3: 5 d 7 2
If you want the f_id
's for for all combinations of airport
& thisTime
where DT$thisTime
is lower than DT_LU$thisTime
, you can achieve that by creating a new variable (or replacing the existing thisTime
) by means of the ceiling
function. An example where I create a new variable thisTime2
and then do a normal join with DT_LU
:
DT[, thisTime2 := ceiling(thisTime)]
setkey(DT, airport, thisTime2)[DT_LU, nomatch=0]
which gives:
t_id airport thisTime thisTime2 f_id
1: 1 a 5.1 6 1
2: 4 a 5.1 6 1
3: 3 a 5.1 6 1
4: 1 a 5.1 6 2
5: 4 a 5.1 6 2
6: 3 a 5.1 6 2
7: 2 d 6.2 7 2
8: 5 d 6.2 7 2
Applied to the data you provided:
> dt[, thisTime2 := ceiling(thisTime)]
> setkey(dt, airport, thisTime2)[dt_lookup, nomatch=0]
t_id airport thisTime thisTime2 f_id
1: 1 a 5.1 6 1
2: 3 a 5.1 6 1
3: 1 a 5.1 6 2
4: 3 a 5.1 6 2
When you want to include al the future values instead of only the first one, you need a somewhat different approach for which you will need the i.col
functionality (which is not documented yet):
1: First set the key to only the airport
columns:
setkey(DT, airport)
setkey(DT_LU, airport)
2: Use the i.col
functionality (which is not documented yet) in j
to get what you want as follows:
DT1 <- DT_LU[DT, .(tid = i.t_id,
tTime = i.thisTime,
fTime = thisTime[i.thisTime < thisTime],
fid = f_id[i.thisTime < thisTime]),
by=.EACHI]
this gives you:
> DT1
airport tid tTime fTime fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 1 5.1 8 2
4: a 4 5.1 6 1
5: a 4 5.1 6 2
6: a 4 5.1 8 2
7: a 3 5.1 6 1
8: a 3 5.1 6 2
9: a 3 5.1 8 2
10: d 2 6.2 7 2
11: d 2 6.2 9 1
12: d 5 6.2 7 2
13: d 5 6.2 9 1
Some explanation: In case when you are joining two datatables where the same columnnames are used, you can refer to the columns of the datatable in i
by preceding the columnnames with i.
. Now it's possible to compare thisTime
from DT
with thisTime
from DT_LU
. With by = .EACHI
you assure that all combinations for with the condition holds are included in the resulting datatable.
Alternatively, you can achieve the same with:
DT2 <- DT_LU[DT, .(airport=i.airport,
tid=i.t_id,
tTime=i.thisTime,
fTime=thisTime[i.thisTime < thisTime],
fid=f_id[i.thisTime < thisTime]),
allow.cartesian=TRUE]
which gives the same result:
> identical(DT1, DT2)
[1] TRUE
When you only want to include future values within a certain boundary, you can use:
DT1 <- DT_LU[DT,
{
idx = i.thisTime < thisTime & thisTime - i.thisTime < 2
.(tid = i.t_id,
tTime = i.thisTime,
fTime = thisTime[idx],
fid = f_id[idx])
},
by=.EACHI]
which gives:
> DT1
airport tid tTime fTime fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 4 5.1 6 1
4: a 4 5.1 6 2
5: a 3 5.1 6 1
6: a 3 5.1 6 2
7: d 2 6.2 7 2
8: d 5 6.2 7 2
When you compare that to the previous result, you see that now the rows 3, 6, 9, 10 and 12 have been removed.
Data:
DT <- data.table(t_id = c(1,4,2,3,5),
airport = c("a","a","d","a","d"),
thisTime = c(5.1, 5.1, 6.2, 5.1, 6.2),
key=c("airport","thisTime"))
DT_LU <- data.table(f_id = c(rep(1,4),rep(2,3)),
airport = c("a","b","c","d","a","d","e"),
thisTime = c(6,7,8,9,6,7,8),
key=c("airport","thisTime"))