21

I am doing a left non-equi join using data.table:

OUTPUT <- DT2[DT1, on=.(DOB, FORENAME, SURNAME, POSTCODE, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]

The OUTPUT contains a correct left join, with the exception that the MONTH column (which is present in DT1) is missing.

Is this a bug in data.table?

NB: Of course, START_DATE, EXPIRY_DATE and MONTH are in the same YYYY-MM-DD, IDate format. The results of the join are correct based on these non-equi criteria. It is just that the column is missing and I need to use it in further work.

Edit 1: Simplified reproducible example

DT1 <- structure(list(ID = c(1, 2, 3), FORENAME = c("JOHN", "JACK", 
"ROB"), SURNAME = c("JOHNSON", "JACKSON", "ROBINSON"), MONTH = structure(c(16953L, 
16953L, 16953L), class = c("IDate", "Date"))), .Names = c("ID", 
"FORENAME", "SURNAME", "MONTH"), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"))

DT2 <- structure(list(CERT_NUMBER = 999, FORENAME = "JOHN", SURNAME = "JOHNSON", 
    START_DATE = structure(16801L, class = c("IDate", "Date")), 
    EXPIRY_DATE = structure(17166L, class = c("IDate", "Date"
    ))), .Names = c("CERT_NUMBER", "FORENAME", "SURNAME", "START_DATE", 
"EXPIRY_DATE"), row.names = c(NA, -1L), class = c("data.table", 
"data.frame"))

OUTPUT <- DT2[DT1, on=.(FORENAME, SURNAME, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]

> OUTPUT
   CERT_NUMBER FORENAME  SURNAME START_DATE EXPIRY_DATE ID
1:         999     JOHN  JOHNSON 2016-06-01  2016-06-01  1
2:          NA     JACK  JACKSON 2016-06-01  2016-06-01  2
3:          NA      ROB ROBINSON 2016-06-01  2016-06-01  3
  • FORENAME and SURNAME are joined on and are present in the output.
  • MONTH is also (non-equi) joined on, and is absent from the output.

Why is this expected behaviour?

Even if it is expected behaviour, it is not helpful in my case, because I need to retain MONTH for further data manipulation.

My expected output would be the same table, but with the MONTH column retained as it is in DT1. After all, what I expect from a left join is for every row and column to be retained from the left table (DT1) and for all columns and only matched rows to be added from the right table (DT2).

   CERT_NUMBER FORENAME  SURNAME START_DATE EXPIRY_DATE ID      MONTH
1:         999     JOHN  JOHNSON 2016-01-01  2016-12-31  1 2016-06-01
2:          NA     JACK  JACKSON       <NA>        <NA>  2 2016-06-01
3:          NA      ROB ROBINSON       <NA>        <NA>  3 2016-06-01

Edit 2: apparently in the output produced by my code, the START and END dates are also wrong! Only person 1 had a certificate with a start date on 1-Jan and end date on 31-Dec! The expected output is what it should be. But the actual output made everything 1-Jan.

Mihail
  • 761
  • 5
  • 22
  • 1
    It is always good to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). As the question currently stands, it is pretty hard for people to see what goes wrong. – Jaap May 31 '17 at 10:23
  • What do you mean with *with the exception that the MONTH column (which is present in DT1) is missing*? Is it missing in `OUTPUT`? If that is your question: that is normal behavior because you are joining on it. Hence, `OUTPUT` will have `START_DATE` and `EXPIRY_DATE` but not `MONTH` because it is compared with the other two. – Jaap May 31 '17 at 10:27
  • @Jaap: yes that is exactly what is happening. However, the `OUTPUT` still contains `DOB`, `FORENAME`, `SURNAME`, `POSTCODE` and these columns are also being joined on. Is there a tweak to keep the `MONTH` column? – Mihail May 31 '17 at 10:33
  • Not sure, but you could try `OUTPUT <- DT2[DT1, on=.(DOB, FORENAME, SURNAME, POSTCODE, START_DATE <= MONTH, EXPIRY_DATE >= MONTH), MONTH := i.MONTH]`. But as I said in my earlier comment: that is hard to test without a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). – Jaap May 31 '17 at 10:38

2 Answers2

32

In data.table, joins of the form x[i] traditionally uses values from i but uses column names from x. Even though this is different from SQL which returns both, this default makes a lot of sense for equi joins since we are interested in all rows from i and if they match then both data.tables have equal values anyway, and if they don't we need to keep those unmatched values from i in result.

But for non-equi joins, since the values might not match exactly, i.e., can fall within a range, there might be cases where we will have to return outputs similar to SQL (or identify such cases and return the result user expects, similar to the case of equi joins). This hasn't been done yet, but I've placed a provision for it at the moment, which is to refer to the columns explicitly with a x. prefix. It is not convenient, I agree. Hopefully this'll be taken care of automatically soon.

Here's how to get your result using x. prefix.

ans <- DT2[DT1, .(CERT_NUMBER, FORENAME, SURNAME, x.START_DATE, x.EXPIRY_DATE, ID, MONTH), 
         on=.(FORENAME, SURNAME, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]

IIRC there's an issue filed on the GitHub project page about this as well.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Thanks - this worked for me and seems to be the solution to this issue indeed! – victor_v Oct 20 '17 at 06:56
  • 2
    Does the latest version of data.table fix this problem? Or how can I fix this problem with the latest version? – wasmetqall Apr 25 '18 at 03:26
  • @wasmetqall a more recent discussion is here which may be helpful https://stackoverflow.com/questions/68366828/non-equi-join-does-not-preserve-original-column-values – user63230 May 25 '22 at 10:44
1

I was trying to use @Arun nice solution but automate it so that you don't have to write out the names of all the columns from the two datasets.

myout <- DT2[DT1, 
             on = .(FORENAME, SURNAME, START_DATE <= MONTH, EXPIRY_DATE >= MONTH), 
             mget(c(paste0("x.", names(DT2)), paste0("i.", names(DT1))))]
setnames(myout, c(names(DT2), names(DT1)))
#extract second unique column names since we want results for every row of DT1
#https://stackoverflow.com/questions/72377103/remove-first-occurrence-of-duplicate-column-names-data-table/
myout[, .SD, .SDcols = ! duplicated(colnames(myout), fromLast = TRUE)]
#    CERT_NUMBER START_DATE EXPIRY_DATE ID FORENAME  SURNAME      MONTH
# 1:         999 2016-01-01  2016-12-31  1     JOHN  JOHNSON 2016-06-01
# 2:          NA       <NA>        <NA>  2     JACK  JACKSON 2016-06-01
# 3:          NA       <NA>        <NA>  3      ROB ROBINSON 2016-06-01

Some useful examples here, here, here, here, here, here.

user63230
  • 4,095
  • 21
  • 43