3

In this toy example (which in reality is big data) I have a data frame with personal records which I need to look matches for in two databases.

df <- structure(list(FORENAME = c("GUY", "JULIANA", "BENN"), SURNAME = c("WEEKS", 
"BAIN", "SARAH"), DOB = c("07/01/1985", "20/06/1967", "04/09/1985"
)), .Names = c("FORENAME", "SURNAME", "DOB"), row.names = c(NA, 
-3L), class = "data.frame")

> df
  FORENAME SURNAME        DOB
1      GUY   WEEKS 07/01/1985
2  JULIANA    BAIN 20/06/1967
3     BENN   SARAH 04/09/1985

Database 1

db1 structure(list(FORENAME = c("GUY", "SARAH", "REBECCA"), SURNAME = c("WEEKS", 
"BENN", "SYMES"), DOB = c("07/01/1985", "04/09/1985", "10/07/1990"
)), row.names = c(NA, -3L), class = "data.frame", .Names = c("FORENAME", 
"SURNAME", "DOB"))

> db1
  FORENAME SURNAME        DOB
1      GUY   WEEKS 07/01/1985
2    SARAH    BENN 04/09/1985
3  REBECCA   SYMES 10/07/1990

Database 2

db2 <- structure(list(FORENAME = c("NAILA", "JOANNE", "JULIANA"), SURNAME = c("KHAN", 
"WHITEHEAD", "BAIN"), DOB = c("06/01/1957", "24/08/1970", "20/06/1967"
)), row.names = c(NA, -3L), class = "data.frame", .Names = c("FORENAME", 
"SURNAME", "DOB"))

> db2
  FORENAME   SURNAME        DOB
1    NAILA      KHAN 06/01/1957
2   JOANNE WHITEHEAD 24/08/1970
3  JULIANA      BAIN 20/06/1967

For the sake of example, I am looking to apply two matching criteria:

  1. All three fields equal
  2. FOREMNAME = SURNAME, SURNAME=FORENAME, DOB=DOB (to capture cases which switched the order of names)

...and look for matches in either database, so something like:

((df1$FORENAME == db1$FORENAME | db2$FORENAME) & 
(df1$SURNAME == db1$SURNAME | db2$SURNAME) &
(df1$DOB == db1$DOB | db2$DOB))
|
((df1$FORENAME == db1$SURNAME | db2$SURNAME) &
(df1$SURNAME ==db1$FORENAME | db2$FORENAME) &
(df1$DOB == db1$DOB | db2$DOB))

I want to keep the original df intact and am looking to create a columnt for the result like so:

  FORENAME SURNAME        DOB RESULT
1      GUY   WEEKS 07/01/1985  MATCH
2  JULIANA    BAIN 20/06/1967  MATCH
3     BENN   SARAH 04/09/1985  MATCH

What would be an elegnt way to do it, assuming the real data is hundreds of thousands of rows long, and will involve matching on c. 8 columns and c. 15 criteria?

As this does not require any fuzzy-matching, an unelegant solution that I can think of would be to do a series of dplyr's inner_joins followed by rbind of everything into one big table and unique(df_matched). Then left_join(df, df_matched, by=c(...list of columns)) in order to create the RESULT column in the original df.

Mihail
  • 761
  • 5
  • 22

0 Answers0