I'm trying to conditionally join two data.tables df1 and df2, but i get a strange output that I cannot explain.
> df1 <- data.table(ID = 1, row_unique_identifier = 1:41, date_1 = as.Date(c(rep('2016-11-14',7), '2017-11-02', rep('2017-11-17',7), rep('2018-12-06', 20), rep('2018-12-07', 6))))
> df2=data.table(ID=1, Date_2A=as.Date(c('2016-11-14', '2018-12-06', '2017-11-17')), Date_2B = as.Date(c('2016-11-14', '2018-12-07', '2017-11-17')), Date_2B_EXTENDED = as.Date(c('2016-11-20', '2018-12-13', '2017-11-23')) )
> df2
ID Date_2A Date_2B Date_2B_EXTENDED
1: 1 2016-11-14 2016-11-14 2016-11-20
2: 1 2018-12-06 2018-12-07 2018-12-13
3: 1 2017-11-17 2017-11-17 2017-11-23
I want to conditionally join df1 with df2 so that:
- all df1 rows should be returned (so it's a left join)
- df1.ID= df2.ID(df1 and df2 are samples, in reality there are many IDs)
- AND df1.date_1 is between df2.Date_2A and df2.Date_2B_EXTENDED
- I want my output to include all of df1 columns PLUS df2 columns with NAs for entries that couldn't be matched.
but when I do the join as follows, the output is strange. It seems that Date_1 overwrites the other three columns of df2 and some strange NA's as well that don't below there:
> df1[, c('Date_2A', 'Date_2B', 'Date_2B_EXTENDED') := df2[df1, on= .(ID
, Date_2A <= date_1
, Date_2B_EXTENDED >= date_1)
,.(Date_2A, Date_2B, Date_2B_EXTENDED)
, mult = "last"]]
> df1
ID row_unique_identifier date_1 Date_2A Date_2B Date_2B_EXTENDED
1: 1 1 2016-11-14 2016-11-14 2016-11-14 2016-11-14
2: 1 2 2016-11-14 2016-11-14 2016-11-14 2016-11-14
3: 1 3 2016-11-14 2016-11-14 2016-11-14 2016-11-14
4: 1 4 2016-11-14 2016-11-14 2016-11-14 2016-11-14
5: 1 5 2016-11-14 2016-11-14 2016-11-14 2016-11-14
6: 1 6 2016-11-14 2016-11-14 2016-11-14 2016-11-14
7: 1 7 2016-11-14 2016-11-14 2016-11-14 2016-11-14
8: 1 8 2017-11-02 2017-11-02 <NA> 2017-11-02
9: 1 9 2017-11-17 2017-11-17 2017-11-17 2017-11-17
10: 1 10 2017-11-17 2017-11-17 2017-11-17 2017-11-17
11: 1 11 2017-11-17 2017-11-17 2017-11-17 2017-11-17
12: 1 12 2017-11-17 2017-11-17 2017-11-17 2017-11-17
13: 1 13 2017-11-17 2017-11-17 2017-11-17 2017-11-17
14: 1 14 2017-11-17 2017-11-17 2017-11-17 2017-11-17
15: 1 15 2017-11-17 2017-11-17 2017-11-17 2017-11-17
16: 1 16 2018-12-06 2018-12-06 2018-12-07 2018-12-06
17: 1 17 2018-12-06 2018-12-06 2018-12-07 2018-12-06
18: 1 18 2018-12-06 2018-12-06 2018-12-07 2018-12-06
19: 1 19 2018-12-06 2018-12-06 2018-12-07 2018-12-06
20: 1 20 2018-12-06 2018-12-06 2018-12-07 2018-12-06
21: 1 21 2018-12-06 2018-12-06 2018-12-07 2018-12-06
22: 1 22 2018-12-06 2018-12-06 2018-12-07 2018-12-06
23: 1 23 2018-12-06 2018-12-06 2018-12-07 2018-12-06
24: 1 24 2018-12-06 2018-12-06 2018-12-07 2018-12-06
25: 1 25 2018-12-06 2018-12-06 2018-12-07 2018-12-06
26: 1 26 2018-12-06 2018-12-06 2018-12-07 2018-12-06
27: 1 27 2018-12-06 2018-12-06 2018-12-07 2018-12-06
28: 1 28 2018-12-06 2018-12-06 2018-12-07 2018-12-06
29: 1 29 2018-12-06 2018-12-06 2018-12-07 2018-12-06
30: 1 30 2018-12-06 2018-12-06 2018-12-07 2018-12-06
31: 1 31 2018-12-06 2018-12-06 2018-12-07 2018-12-06
32: 1 32 2018-12-06 2018-12-06 2018-12-07 2018-12-06
33: 1 33 2018-12-06 2018-12-06 2018-12-07 2018-12-06
34: 1 34 2018-12-06 2018-12-06 2018-12-07 2018-12-06
35: 1 35 2018-12-06 2018-12-06 2018-12-07 2018-12-06
36: 1 36 2018-12-07 2018-12-07 2018-12-07 2018-12-07
37: 1 37 2018-12-07 2018-12-07 2018-12-07 2018-12-07
38: 1 38 2018-12-07 2018-12-07 2018-12-07 2018-12-07
39: 1 39 2018-12-07 2018-12-07 2018-12-07 2018-12-07
40: 1 40 2018-12-07 2018-12-07 2018-12-07 2018-12-07
41: 1 41 2018-12-07 2018-12-07 2018-12-07 2018-12-07
The way I understand the output is that somehow the comparison is successful but date_1 overwrites the two columns that it's being compared to: Date_2A and Date_2B_EXTENDED. The NA entry in Date_2B makes sense --> there was no matching.... but date_1 still overwrote date_2A and date_2B_EXTENDED on that row. How do I prevent the overwriting?? I've never had this happened to me before. I'm clearly specifying to add new column names to df1 after matching, so what's going on?
EDITED: I found a way around this issue by adding two extra columns: Date_2A_dummy and Date_2B_EXTENDED_dummy to df2 which are the same as Date_2A and Date_2B_EXTENDED respectively. Then changed my comparison statements to these two dummy variables and it works. But I'm still wondering if this is the best way?! To me it seems this issue that the two variables are getting over-written is a bug? I can't think of any use case for them to be overwritten.