3

This is a follow-up on this question, where the accepted answer showed an example of a matching exercise using data.table, including non-equi conditions.

Background

The basic set up is that we have DT1 with a sample of people's details, and DT2, which is a sort-of master database. And the aim is to find out whether each person in DT1 matches at least one entry in DT2.

First, we initialize a column that would indicate a match to FALSE, so that its values could be updated to TRUE whenever a match is found.

DT1[, MATCHED := FALSE]

The following general solution is then used to update the column:

DT1[, MATCHED := DT2[.SD, on=.(Criteria), .N, by=.EACHI ]$N > 0L ]

In theory, it looks (and should work) fine. The sub-expression DT2[.SD, on=.(Criteria), .N, by=.EACHI] produces a sub-table with each row from DT1, and computes the N column which is the number of matches for that row found in DT2. Then, whenever N is greater than zero, the value of MATCHED in DT1 is updated to TRUE.

It works as intended in a trivial reproducible example. But I encountered some unexpected behaviour using it with the real data, and cannot get to the bottom of it. I may be missing something or it may be a bug. Unfortunately, I cannot provide a minimal reproducible example, because the data is big, and it only shows in the big data. But I will try to document it as best I can.

Unexpected behaviour or a bug

What helped noticing this is that, for a historic reason, the matches needed to be sought in two separate databases, and hence, the filter !(MATCHED) was added to the expression to update only those values which have not already been matched:

DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(Criteria), .N, by=.EACHI ]$N > 0L ]

I then noticed that if the line is re-ran several times, with each subsequent run, there will be more and more matches, which were not matched in the preceding runs. (Nothing to do with separate databases, each run matches to DT2).

First run:

   MATCHED       N
1:   FALSE 3248007
2:    TRUE 2379514

Second run:

   MATCHED       N
1:   FALSE 2149648
2:    TRUE 3477873

To investigate, I then filtered cases which weren't matched on the first run, but were matched on the second. It looks like most cases were false negatives, i.e. those which should have been matched on the first run, but weren't. (But with many runs, eventually there appear also many false positives).

For example, here is one entry from DT1:

         DATE FORENAME SURNAME
1: 2016-01-01     JOHN   SMITH

And a matching entry from DT2:

   START_DATE EXPIRY_DATE FORENAME SURNAME
1: 2015-09-09  2017-05-01     JOHN   SMITH

Running the sub-expression (described above) alone, outside the main expression, to look at the N numbers, we see that it does not result in a match, when it should (N=0). (You may also note that START_DATE and END_DATE take on the value of DATE in the output, but that is a whole other issue).

SUB <- DF2[DF1, on=.(FORENAME, SURNAME, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI]
SUB[FORENAME=="JOHN" & "SURNAME=="SMITH"]

   FORENAME SURNAME START_DATE EXPIRY_DATE N
1:     JOHN   SMITH 2016-01-01  2016-01-01 0

However, the buggy behaviour is that the result is affected by what other rows are present in DF1. For example, suppose I know that JOHN SMITH's row number in DF1 is 149 and filter DF1 to only that row:

DF2[DF1[149], on=.(FORENAME, SURNAME, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI]

   FORENAME SURNAME START_DATE EXPIRY_DATE N
1:     JOHN   SMITH 2016-01-01  2016-01-01 1

Secondly, I also noticed that the buggy behaviour occurs only with more than one non-equi criterion in the conditions. If the conditions are on=.(FORENAME, SURNAME, START_DATE <= DATE), there are no longer any differences between the runs and all rows appear to be matched correctly the first time.

Unfortunately, to solve the real-world problem, I must have several non-equi matching conditions. Not only to ensure that DT1's DATE is between DT2's START_DATE and END_DATEs, but also that DT1's CHECKING_DATE is before DT2's EFFECTIVE_DATE, etc.

To summarize

Non-equi joins in data.table behave in a buggy way when:

  1. Some rows are present/absent from one of the tables

AND

  1. More than one non-equi conditions

Update: Reproducible example

set.seed(123)
library(data.table)
library(stringi)

n <- 100000

DT1 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
                  DATE = sample(seq(as.Date('2016-01-01'), as.Date('2016-12-31'), by="day"), n, replace=T))

DT2 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
                  START_DATE = sample(seq(as.Date('2015-01-01'), as.Date('2017-12-31'), by="day"), n, replace=T))

DT2[, EXPIRY_DATE := START_DATE + floor(runif(1000, 200,300))]

#Initialization
DT1[, MATCHED := FALSE]

#First run
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
DT1[, .N, by=MATCHED]

   MATCHED     N
1:   FALSE 85833
2:    TRUE 14167

#Second run
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
DT1[, .N, by=MATCHED]

   MATCHED     N
1:   FALSE 73733
2:    TRUE 26267

#And so on with subsequent runs...
Mihail
  • 761
  • 5
  • 22
  • 1
    Uh oh, this is interesting/useful to me, since I use the `.N, by=.EACHI]$N > 0` idiom all the time myself. As you say, when only the relevant rows are present it gives the right result (my repro is here: https://chat.stackoverflow.com/rooms/25312/r-public ). It'd be great if you could come up with a reproducible example. – Frank Jul 20 '17 at 12:17
  • @Frank, see if you can replicate it on some of your own data by re-running the matching statement repeatedly, but each time with a `!(MATCHED)` filter, so the input is different each time, like I described in the post. In the meantime, I'll see if I can come up with a reproducible example. The problem is, I've only seen it occur when there are thousands of rows, and all my data are personal confidential details. – Mihail Jul 20 '17 at 12:31
  • @Mihael what is the order of magnitude of rows/columns you are talking about, when this incident takes place? – amonk Jul 20 '17 at 15:49
  • 1
    @Frank, I've edited my post by adding a reproducible example at the end. – Mihail Jul 20 '17 at 19:12
  • 1
    @amonk, see the reproducible example I've added to the post. – Mihail Jul 20 '17 at 19:12
  • 2
    @Frank: added `set.seed()` and updated the outputs. – Mihail Jul 20 '17 at 19:32
  • Thanks, okay so an example is row `w = 71`: when subsetting we have the right answer `DT1[w, DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]` but if we subset "afterwards", we get the wrong answer `DT1[, DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ][w]`. Code for finding this... https://chat.stackoverflow.com/transcript/message/38239131#38239131 – Frank Jul 20 '17 at 19:41
  • Filed an issue with the package: https://github.com/Rdatatable/data.table/issues/2275 (forgot to link it earlier) – Frank Jul 21 '17 at 14:08
  • The issue linked in the last comment is fixed and closed on the devel branch of the package now. I guess it will be available on CRAN for 1.10.6 – Frank Nov 08 '17 at 19:20

1 Answers1

1

This is a workaround solution, which is not at all elegant, but appears to give the right result while the bug is not fixed.

First, we need each row in DT1 and DT2 to have a unique id. A row number will do.

DT1[, DT1_ID := 1:nrow(DT1)]
DT2[, DT2_ID := 1:nrow(DT2)]

Then, we do a following right join to find the matches:

M <- DT2[DT1, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE)]

head(M, 3)

   RANDOM_STRING START_DATE EXPIRY_DATE DT2_ID DT1_ID
1:         diejk 2016-03-30  2016-03-30     NA      1
2:         afjgf 2016-09-14  2016-09-14     NA      2
3:         kehgb 2016-12-11  2016-12-11     NA      3

M has each row from DT1 next to all matches for that row in DT2. When DT2_ID = NA, there was no match. nrow(M) = 100969, indicating that some DT1 rows were matched to >1 DT2 row. (Dates also took on the wrong values.)

Next, we can use an ifelse() statement to label rows in the original DT1 according to whether or not they were matched.

DT1$MATCHED <- ifelse(DT1$DT1_ID %in% M[!is.na(DT2_ID)]$DT1_ID, TRUE, FALSE)

Final result: 13,316 matches of 100,000

DT1[, .N, by=MATCHED]

   MATCHED     N
1:   FALSE 86684
2:    TRUE 13316
Mihail
  • 761
  • 5
  • 22
  • 1
    Looks good, but fyi, you can use `.I` instead of `1:nrow(DT)` and `cond` instead of `ifelse(cond, TRUE, FALSE)`. – Frank Jul 21 '17 at 14:09
  • 1
    A somewhat simpler variation is: `w = DT1[DT2, on=.(RANDOM_STRING, DATE >= START_DATE, DATE <= EXPIRY_DATE), which=TRUE, nomatch=0]; DT1[, m := DT1_ID %in% w ]`. This works because nomatch=0 drops the unmatched/NA cases and which=TRUE returns only row numbers of DT1. – Frank Jul 21 '17 at 14:14