In this toy example (which in reality is big data) I have a data frame with personal records which I need to look matches for in two databases.
df <- structure(list(FORENAME = c("GUY", "JULIANA", "BENN"), SURNAME = c("WEEKS",
"BAIN", "SARAH"), DOB = c("07/01/1985", "20/06/1967", "04/09/1985"
)), .Names = c("FORENAME", "SURNAME", "DOB"), row.names = c(NA,
-3L), class = "data.frame")
> df
FORENAME SURNAME DOB
1 GUY WEEKS 07/01/1985
2 JULIANA BAIN 20/06/1967
3 BENN SARAH 04/09/1985
Database 1
db1 structure(list(FORENAME = c("GUY", "SARAH", "REBECCA"), SURNAME = c("WEEKS",
"BENN", "SYMES"), DOB = c("07/01/1985", "04/09/1985", "10/07/1990"
)), row.names = c(NA, -3L), class = "data.frame", .Names = c("FORENAME",
"SURNAME", "DOB"))
> db1
FORENAME SURNAME DOB
1 GUY WEEKS 07/01/1985
2 SARAH BENN 04/09/1985
3 REBECCA SYMES 10/07/1990
Database 2
db2 <- structure(list(FORENAME = c("NAILA", "JOANNE", "JULIANA"), SURNAME = c("KHAN",
"WHITEHEAD", "BAIN"), DOB = c("06/01/1957", "24/08/1970", "20/06/1967"
)), row.names = c(NA, -3L), class = "data.frame", .Names = c("FORENAME",
"SURNAME", "DOB"))
> db2
FORENAME SURNAME DOB
1 NAILA KHAN 06/01/1957
2 JOANNE WHITEHEAD 24/08/1970
3 JULIANA BAIN 20/06/1967
For the sake of example, I am looking to apply two matching criteria:
- All three fields equal
- FOREMNAME = SURNAME, SURNAME=FORENAME, DOB=DOB (to capture cases which switched the order of names)
...and look for matches in either database, so something like:
((df1$FORENAME == db1$FORENAME | db2$FORENAME) &
(df1$SURNAME == db1$SURNAME | db2$SURNAME) &
(df1$DOB == db1$DOB | db2$DOB))
|
((df1$FORENAME == db1$SURNAME | db2$SURNAME) &
(df1$SURNAME ==db1$FORENAME | db2$FORENAME) &
(df1$DOB == db1$DOB | db2$DOB))
I want to keep the original df
intact and am looking to create a columnt for the result like so:
FORENAME SURNAME DOB RESULT
1 GUY WEEKS 07/01/1985 MATCH
2 JULIANA BAIN 20/06/1967 MATCH
3 BENN SARAH 04/09/1985 MATCH
What would be an elegnt way to do it, assuming the real data is hundreds of thousands of rows long, and will involve matching on c. 8 columns and c. 15 criteria?
As this does not require any fuzzy-matching, an unelegant solution that I can think of would be to do a series of dplyr
's inner_join
s followed by rbind
of everything into one big table and unique(df_matched)
. Then left_join(df, df_matched, by=c(...list of columns))
in order to create the RESULT
column in the original df
.