-1

I'm trying to conditionally join two data.tables df1 and df2, but i get a strange output that I cannot explain.

> df1 <- data.table(ID = 1, row_unique_identifier = 1:41, date_1 = as.Date(c(rep('2016-11-14',7), '2017-11-02', rep('2017-11-17',7), rep('2018-12-06', 20), rep('2018-12-07', 6))))
> df2=data.table(ID=1, Date_2A=as.Date(c('2016-11-14', '2018-12-06', '2017-11-17')), Date_2B = as.Date(c('2016-11-14', '2018-12-07', '2017-11-17')), Date_2B_EXTENDED = as.Date(c('2016-11-20', '2018-12-13', '2017-11-23')) )


> df2
   ID    Date_2A    Date_2B Date_2B_EXTENDED
1:  1 2016-11-14 2016-11-14       2016-11-20
2:  1 2018-12-06 2018-12-07       2018-12-13
3:  1 2017-11-17 2017-11-17       2017-11-23

I want to conditionally join df1 with df2 so that:

  1. all df1 rows should be returned (so it's a left join)
  2. df1.ID= df2.ID(df1 and df2 are samples, in reality there are many IDs)
  3. AND df1.date_1 is between df2.Date_2A and df2.Date_2B_EXTENDED
  4. I want my output to include all of df1 columns PLUS df2 columns with NAs for entries that couldn't be matched.

but when I do the join as follows, the output is strange. It seems that Date_1 overwrites the other three columns of df2 and some strange NA's as well that don't below there:

> df1[, c('Date_2A', 'Date_2B', 'Date_2B_EXTENDED') := df2[df1, on= .(ID
                                                                    , Date_2A <= date_1
                                                                    , Date_2B_EXTENDED >= date_1) 
                                                               ,.(Date_2A, Date_2B, Date_2B_EXTENDED)
                                                               , mult = "last"]]
> df1
    ID row_unique_identifier     date_1    Date_2A    Date_2B Date_2B_EXTENDED
 1:  1                     1 2016-11-14 2016-11-14 2016-11-14       2016-11-14
 2:  1                     2 2016-11-14 2016-11-14 2016-11-14       2016-11-14
 3:  1                     3 2016-11-14 2016-11-14 2016-11-14       2016-11-14
 4:  1                     4 2016-11-14 2016-11-14 2016-11-14       2016-11-14
 5:  1                     5 2016-11-14 2016-11-14 2016-11-14       2016-11-14
 6:  1                     6 2016-11-14 2016-11-14 2016-11-14       2016-11-14
 7:  1                     7 2016-11-14 2016-11-14 2016-11-14       2016-11-14
 8:  1                     8 2017-11-02 2017-11-02       <NA>       2017-11-02
 9:  1                     9 2017-11-17 2017-11-17 2017-11-17       2017-11-17
10:  1                    10 2017-11-17 2017-11-17 2017-11-17       2017-11-17
11:  1                    11 2017-11-17 2017-11-17 2017-11-17       2017-11-17
12:  1                    12 2017-11-17 2017-11-17 2017-11-17       2017-11-17
13:  1                    13 2017-11-17 2017-11-17 2017-11-17       2017-11-17
14:  1                    14 2017-11-17 2017-11-17 2017-11-17       2017-11-17
15:  1                    15 2017-11-17 2017-11-17 2017-11-17       2017-11-17
16:  1                    16 2018-12-06 2018-12-06 2018-12-07       2018-12-06
17:  1                    17 2018-12-06 2018-12-06 2018-12-07       2018-12-06
18:  1                    18 2018-12-06 2018-12-06 2018-12-07       2018-12-06
19:  1                    19 2018-12-06 2018-12-06 2018-12-07       2018-12-06
20:  1                    20 2018-12-06 2018-12-06 2018-12-07       2018-12-06
21:  1                    21 2018-12-06 2018-12-06 2018-12-07       2018-12-06
22:  1                    22 2018-12-06 2018-12-06 2018-12-07       2018-12-06
23:  1                    23 2018-12-06 2018-12-06 2018-12-07       2018-12-06
24:  1                    24 2018-12-06 2018-12-06 2018-12-07       2018-12-06
25:  1                    25 2018-12-06 2018-12-06 2018-12-07       2018-12-06
26:  1                    26 2018-12-06 2018-12-06 2018-12-07       2018-12-06
27:  1                    27 2018-12-06 2018-12-06 2018-12-07       2018-12-06
28:  1                    28 2018-12-06 2018-12-06 2018-12-07       2018-12-06
29:  1                    29 2018-12-06 2018-12-06 2018-12-07       2018-12-06
30:  1                    30 2018-12-06 2018-12-06 2018-12-07       2018-12-06
31:  1                    31 2018-12-06 2018-12-06 2018-12-07       2018-12-06
32:  1                    32 2018-12-06 2018-12-06 2018-12-07       2018-12-06
33:  1                    33 2018-12-06 2018-12-06 2018-12-07       2018-12-06
34:  1                    34 2018-12-06 2018-12-06 2018-12-07       2018-12-06
35:  1                    35 2018-12-06 2018-12-06 2018-12-07       2018-12-06
36:  1                    36 2018-12-07 2018-12-07 2018-12-07       2018-12-07
37:  1                    37 2018-12-07 2018-12-07 2018-12-07       2018-12-07
38:  1                    38 2018-12-07 2018-12-07 2018-12-07       2018-12-07
39:  1                    39 2018-12-07 2018-12-07 2018-12-07       2018-12-07
40:  1                    40 2018-12-07 2018-12-07 2018-12-07       2018-12-07
41:  1                    41 2018-12-07 2018-12-07 2018-12-07       2018-12-07

The way I understand the output is that somehow the comparison is successful but date_1 overwrites the two columns that it's being compared to: Date_2A and Date_2B_EXTENDED. The NA entry in Date_2B makes sense --> there was no matching.... but date_1 still overwrote date_2A and date_2B_EXTENDED on that row. How do I prevent the overwriting?? I've never had this happened to me before. I'm clearly specifying to add new column names to df1 after matching, so what's going on?


EDITED: I found a way around this issue by adding two extra columns: Date_2A_dummy and Date_2B_EXTENDED_dummy to df2 which are the same as Date_2A and Date_2B_EXTENDED respectively. Then changed my comparison statements to these two dummy variables and it works. But I'm still wondering if this is the best way?! To me it seems this issue that the two variables are getting over-written is a bug? I can't think of any use case for them to be overwritten.

Ankhnesmerira
  • 1,386
  • 15
  • 29
  • so, I found a way around this issue by adding two extra columns of Date_2A_dummy and Date_2B_EXTENDED_dummy to df2 which are the same as Date_2A and Date_2B_EXTENDED respectively. Then changed my comparison to these two dummy variables and it's OK. But still, I'm wondering if this is the best way?! to me it seems it's a bug that the two variables are getting over-written – Ankhnesmerira May 29 '20 at 08:09

1 Answers1

1

I think this is what you need...

df1[ df2, 
     `:=`( Date_2A = i.Date_2A, Date_2B = i.Date_2B, Date_2B_EXTENDED = i.Date_2B_EXTENDED ), 
     on = .( ID, date_1 >= Date_2A, date_1 <= Date_2B_EXTENDED ) ][]

output

#     ID row_unique_identifier     date_1    Date_2A    Date_2B Date_2B_EXTENDED
#  1:  1                     1 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  2:  1                     2 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  3:  1                     3 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  4:  1                     4 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  5:  1                     5 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  6:  1                     6 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  7:  1                     7 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  8:  1                     8 2017-11-02       <NA>       <NA>             <NA>
#  9:  1                     9 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 10:  1                    10 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 11:  1                    11 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 12:  1                    12 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 13:  1                    13 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 14:  1                    14 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 15:  1                    15 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 16:  1                    16 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 17:  1                    17 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 18:  1                    18 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 19:  1                    19 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 20:  1                    20 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 21:  1                    21 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 22:  1                    22 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 23:  1                    23 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 24:  1                    24 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 25:  1                    25 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 26:  1                    26 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 27:  1                    27 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 28:  1                    28 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 29:  1                    29 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 30:  1                    30 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 31:  1                    31 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 32:  1                    32 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 33:  1                    33 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 34:  1                    34 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 35:  1                    35 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 36:  1                    36 2018-12-07 2018-12-06 2018-12-07       2018-12-13
# 37:  1                    37 2018-12-07 2018-12-06 2018-12-07       2018-12-13
# 38:  1                    38 2018-12-07 2018-12-06 2018-12-07       2018-12-13
# 39:  1                    39 2018-12-07 2018-12-06 2018-12-07       2018-12-13
# 40:  1                    40 2018-12-07 2018-12-06 2018-12-07       2018-12-13
# 41:  1                    41 2018-12-07 2018-12-06 2018-12-07       2018-12-13
#     ID row_unique_identifier     date_1    Date_2A    Date_2B Date_2B_EXTENDED
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • Thanks so much. Is it possible to explain how this code works? i want df1 to be outer joined, while it seems your solution is outer joining df2. I've never worked with `:=` before (in '''') and what does the [] do at the end? <:} Many thanks – Ankhnesmerira May 29 '20 at 08:58
  • 1
    the `[]` at the end outputs the result... it can be left out. The join is an update join of df1, with data from df2. so df1 stays untouched, and the first match from the `on`-statement is retured.. even when there are multiple matches, only the first one is returned. – Wimpel May 29 '20 at 09:49