This is a follow-up on this question, where the accepted answer showed an example of a matching exercise using data.table
, including non-equi conditions.
Background
The basic set up is that we have DT1
with a sample of people's details, and DT2
, which is a sort-of master database. And the aim is to find out whether each person in DT1
matches at least one entry in DT2
.
First, we initialize a column that would indicate a match to FALSE
, so that its values could be updated to TRUE
whenever a match is found.
DT1[, MATCHED := FALSE]
The following general solution is then used to update the column:
DT1[, MATCHED := DT2[.SD, on=.(Criteria), .N, by=.EACHI ]$N > 0L ]
In theory, it looks (and should work) fine. The sub-expression DT2[.SD, on=.(Criteria), .N, by=.EACHI]
produces a sub-table with each row from DT1
, and computes the N
column which is the number of matches for that row found in DT2
. Then, whenever N
is greater than zero, the value of MATCHED
in DT1
is updated to TRUE
.
It works as intended in a trivial reproducible example. But I encountered some unexpected behaviour using it with the real data, and cannot get to the bottom of it. I may be missing something or it may be a bug. Unfortunately, I cannot provide a minimal reproducible example, because the data is big, and it only shows in the big data. But I will try to document it as best I can.
Unexpected behaviour or a bug
What helped noticing this is that, for a historic reason, the matches needed to be sought in two separate databases, and hence, the filter !(MATCHED)
was added to the expression to update only those values which have not already been matched:
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(Criteria), .N, by=.EACHI ]$N > 0L ]
I then noticed that if the line is re-ran several times, with each subsequent run, there will be more and more matches, which were not matched in the preceding runs. (Nothing to do with separate databases, each run matches to DT2).
First run:
MATCHED N
1: FALSE 3248007
2: TRUE 2379514
Second run:
MATCHED N
1: FALSE 2149648
2: TRUE 3477873
To investigate, I then filtered cases which weren't matched on the first run, but were matched on the second. It looks like most cases were false negatives, i.e. those which should have been matched on the first run, but weren't. (But with many runs, eventually there appear also many false positives).
For example, here is one entry from DT1
:
DATE FORENAME SURNAME
1: 2016-01-01 JOHN SMITH
And a matching entry from DT2:
START_DATE EXPIRY_DATE FORENAME SURNAME
1: 2015-09-09 2017-05-01 JOHN SMITH
Running the sub-expression (described above) alone, outside the main expression, to look at the N
numbers, we see that it does not result in a match, when it should (N=0
). (You may also note that START_DATE
and END_DATE
take on the value of DATE
in the output, but that is a whole other issue).
SUB <- DF2[DF1, on=.(FORENAME, SURNAME, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI]
SUB[FORENAME=="JOHN" & "SURNAME=="SMITH"]
FORENAME SURNAME START_DATE EXPIRY_DATE N
1: JOHN SMITH 2016-01-01 2016-01-01 0
However, the buggy behaviour is that the result is affected by what other rows are present in DF1
. For example, suppose I know that JOHN SMITH
's row number in DF1
is 149 and filter DF1
to only that row:
DF2[DF1[149], on=.(FORENAME, SURNAME, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI]
FORENAME SURNAME START_DATE EXPIRY_DATE N
1: JOHN SMITH 2016-01-01 2016-01-01 1
Secondly, I also noticed that the buggy behaviour occurs only with more than one non-equi criterion in the conditions. If the conditions are on=.(FORENAME, SURNAME, START_DATE <= DATE)
, there are no longer any differences between the runs and all rows appear to be matched correctly the first time.
Unfortunately, to solve the real-world problem, I must have several non-equi matching conditions. Not only to ensure that DT1
's DATE
is between DT2
's START_DATE
and END_DATE
s, but also that DT1
's CHECKING_DATE
is before DT2
's EFFECTIVE_DATE
, etc.
To summarize
Non-equi joins in data.table
behave in a buggy way when:
- Some rows are present/absent from one of the tables
AND
- More than one non-equi conditions
Update: Reproducible example
set.seed(123)
library(data.table)
library(stringi)
n <- 100000
DT1 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
DATE = sample(seq(as.Date('2016-01-01'), as.Date('2016-12-31'), by="day"), n, replace=T))
DT2 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
START_DATE = sample(seq(as.Date('2015-01-01'), as.Date('2017-12-31'), by="day"), n, replace=T))
DT2[, EXPIRY_DATE := START_DATE + floor(runif(1000, 200,300))]
#Initialization
DT1[, MATCHED := FALSE]
#First run
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
DT1[, .N, by=MATCHED]
MATCHED N
1: FALSE 85833
2: TRUE 14167
#Second run
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
DT1[, .N, by=MATCHED]
MATCHED N
1: FALSE 73733
2: TRUE 26267
#And so on with subsequent runs...