0

I have to compare two people datasets between each other.

Let's say i have a data frame with few columns a =.

    ID  |     Name        |      Gender      |  Country   
   ——————————————————————————————————————————————————————————
    1   | Mattias Adams   |        M         |    UK           
    2   | James Alan      |        M         |    Canada
    3   | Dana Benton     |        F         |    USA
    4   | Ella Collins    |        F         |    USA

And b =

    ID  | First_Name | Last_name  | Third_name |  Whole_name       |  Gender   
————————————————————————————————————————————————————————————————————————————
    1   |    Gary    |  Cole      |    Allan   | Gary Allan Cole   |    M
    2   |    Dana    |  Benton    |    NA      | Dana Benton       |    F
    3   |    Lena    |  Jamison   |    Anne    | Lena Anne Jamison |    F        
    4   |    Matt    |  King      |    NA      | Matt King         |    M

Data frame a is the bigger one contains around 100,000 rows while b contains less than 1,000.

The goal is to use the data in b to find matching records in a. So that the whole row in a is returned if there is a match.

I want to try two ways. First to find exact matches from b$"Whole_name" in a$"Name".

Exact matching:

    eue_wn <- as.character(b$"Whole_name")
    eue_wn_match <- a[which(as.character(a$"Name") %in% eue_wn),]
        if (nrow(eue_wn_match) == 0) {
             eue_wn_match <- "No matches"
        }

Output of eue_wn_matc in this case would be:

    ID  |     Name        |      Gender      |  Country   
   —————————————————————————————————————————————————————————— 
    3   | Dana Benton     |        F         |    USA

Pattern matching:

    eup_ln <- paste(as.character(b$"Last_name"), collapse = "|")
    eup_fn <- paste(as.character(b$"First_Name"), collapse = "|")
    eup_tn <- paste(as.character(b$"Third_name"), collapse = "|")

    eup_match <- a[which(grepl(eup_ln, as.character(a$"Name"), ignore.case = TRUE)),]         #First filter (last name)
    if (nrow(eup_match) == 0) {
        eup_match <- "No matches"
    } 
    if (nrow(eup_match) > 0) {
        eup_match2 <- eup_match[which(grepl(eup_fn, as.character(eup_match$"Name"), ignore.case = TRUE)),]     #Second filter (first name)
          if (nrow(eup_match2) == 0 ) {
            eup_match2 <- "No matches"
          }
    }

    if (nrow(eup_match2) > 0) {
      eup_match3 <- eup_match2[which(grepl(eup_tn, as.character(eup_match2$"Name"), ignore.case = TRUE)),]     #Third filter (third_name)
      if (nrow(eup_match3) == 0 ) {
        eup_match3 <- "No matches"
      }
    }

So in this process the matching takes place in 3 stages. First eup_match is the result of finding the last name. Than it takes that result and looks for second match which is the first name name results eup_match2 shows record that matches both conditions. Finally the last result is taken and is being matched also with the third name eup_match3

in this case the result of all three of them is the same:

    ID  |     Name        |      Gender      |  Country   
   —————————————————————————————————————————————————————————— 
    3   | Dana Benton     |        F         |    USA

And that is incorrect. Only eup_match and eup_match2 should have that output. Since in the first stage we were matching Dana Benton(a) and Dana(b) In the next stage the match was Dana Benton(a) and Benton (b). And since she does not have a third name it is impossible to match her with third name. The problem is in:

eup_tn <- paste(as.character(b$"Third_name"), collapse = "|")

The output off this is :

"Allan|NA|Anne|NA"

Because the NA was converted to character the function was able to find pattern in a and b. In this particular case Dana Benson (a) and NA (b)

Any idea on how to correct that ?

Another question is related to the output. Is there any way to output both results from a and b

Example: if we are only matching the a$Name with b$First_Name by patterns the result would be

ID  |     Name        |      Gender      |  Country   | Match | Match ID
———————————————————————————————————————————————————————————————————————————
1   | Mattias Adams   |        M         |    UK      | Matt  |    4 
3   | Dana Benton     |        F         |    USA     | Dana  |    2

So that the first 4 columns are from the data set a and the last two from b Columns Match | Match ID would show based on what were the records in b matched.

The desired output for the test example given would be:

    ID  |     Name        |      Gender      |  Country   
   —————————————————————————————————————————————————————————— 
    3   | Dana Benton     |        F         |    USA

Sorry for the long post. I tried to make it as clear as possible. If anyone would like to recreate this, xlsx files a and b as well as the r code can be found here: MyDropbox

If anyone has other suggestions on how to approach this topic is welcome to present them. Thank you for the help.

jerry_k
  • 363
  • 1
  • 5
  • 20
  • What is your expected result of this function? – Roman Dec 03 '18 at 20:11
  • I'm sorry if it was not clear enough. Basicly we need to compare some lists of people (in this example df `b`) against our own list of clients (df `a`) . The data in the lists of people we need to compare is not high quality and the most accurate fields are the name fields (`First_name`, `Last_name`, `Third_name` and `Whole_name` ) – jerry_k Dec 04 '18 at 08:42
  • If you don’t always have a Third Name, I would look into `fuzzyjoin::stringdist_inner_join()` on a combination of First_Name & Last_Name, which is an approach that merges the ones that I proposed. – Roman Dec 04 '18 at 09:20
  • And: what would be the **exact expected result** for a perfect match-function for the data that you gave? – Roman Dec 04 '18 at 09:23
  • Thank you for your help and suggestions. I owe you a beer :) . That is right the `Third_name` column has a lot of empty cells. The perfect result for pattern searching would be: if I serach in `a` for people listed in `b` the output would contain data from the person in `a` (i.e. our client) and information based on which record in `b` was he matched (the string that matched and the id number that is in the first column of the same record is enought). – jerry_k Dec 04 '18 at 11:03
  • No problem, that’s what SO is for. I mean: could you post in your original post the exact output that you would expect from the data that you gave? I mean: is it only Dana? Or also, e.g., Matt? – Roman Dec 04 '18 at 11:06
  • 1
    It that case the desired output would be something like first the whole row from `a` and than the match from `b` alongside the id number of the match ... So basically `3 | Dana Benton | F | USA | Dana | 2` – jerry_k Dec 04 '18 at 21:00
  • Great, thank you. This is information that you should append to the original post. I edited my solution and it matches closely what you wanted to achieve. – Roman Dec 04 '18 at 23:22

2 Answers2

1

Approach #1: Exact match

Why not something along the lines of

library(stringr)
library(dplyr)
a <- a %>%
    # Extract first and last names into new variables
    mutate(First_Name = str_extract(Name, "^[A-z]+"),
           Last_Name = str_extract(Name, "[A-z]+$"),)

# Inner Join by first and last name.
# Add a suffix to be able to distinguish the origin of columns.
b %>% inner_join(a, by = c("First_Name", "Last_Name"), suffix = c(".b", ".a")) %>%
    # Select the columns you want to see.
    # Note that only the colums that have an ambiguous name have a suffix.
    select(ID.a, Name, Gender.a, Country, First_Name, Last_Name, ID.b)

Works great if you are looking for exact matches only. If you like, you can also extract the middle name from a string via str_extract(string, "[^A-z]+[A-z]+[^A-z$]").

Result:
  ID.a        Name Gender.a Country First_Name Last_Name ID.b
1    3 Dana Benton        F     USA       Dana    Benton    2

Approach #2: Word distance (Jaro-Winkler)

Expanding from this great post:

library(RecordLinkage)
library(dplyr)

lookup <- expand.grid(target = a$Name, source = b$Whole_Name, stringsAsFactors = FALSE)

lookup %>% group_by(target) %>%
    mutate(match_score = jarowinkler(target, source)) %>%
    summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)]) %>%
    inner_join(b, c("matched_to" = "Whole_Name"))

Anything above .8 or .9 should be a good match. Still not perfect. You could try to match first and last name separately, if your data is clean.

Result:
# A tibble: 4 x 8
  target        match matched_to         ID First_Name Last_Name Third_Name Gender
  <chr>         <dbl> <chr>           <dbl> <chr>      <chr>     <chr>      <chr> 
1 Dana Benton   1     Dana Benton         2 Dana       Benton    NA         F     
2 Ella Collins  0.593 Matt King           4 Matt       King      NA         M     
3 James Alan    0.667 Gary Allan Cole     1 Gary       Cole      Allan      M     
4 Mattias Adams 0.792 Matt King           4 Matt       King      NA         M     


Approach #3: Word distance (Levenshtein)

Same as above, just using the Levenshtein distance and which.min()

library(RecordLinkage)
library(dplyr)


lookup <- expand.grid(target = a$Name, source = b$Whole_Name, stringsAsFactors = FALSE)

lookup %>% group_by(target) %>%
    mutate(match_score = levenshteinDist(target, source)) %>%
    summarise(match = match_score[which.min(match_score)], matched_to = ref[which.min(match_score)]) %>%
    inner_join(b, c("matched_to" = "Whole_Name"))

As expected, this gives a poorer performance than JW.

Result:
# A tibble: 4 x 8
  target        match matched_to     ID First_Name Last_Name Third_Name Gender
  <chr>         <int> <chr>       <dbl> <chr>      <chr>     <chr>      <chr> 
1 Dana Benton       0 Dana Benton     2 Dana       Benton    NA         F     
2 Ella Collins      9 Dana Benton     2 Dana       Benton    NA         F     
3 James Alan        8 Matt King       4 Matt       King      NA         M     
4 Mattias Adams     8 Matt King       4 Matt       King      NA         M     


Data

a <- structure(list(ID = c(1, 2, 3, 4), Name = c("Mattias Adams", "James Alan", "Dana Benton", "Ella Collins"), Gender = c("M", "M", "F", "F"), Country = c("UK", "Canada", "USA", "USA")), .Names = c("ID", "Name", "Gender", "Country"), row.names = c(NA, -4L), class = "data.frame")
b <- structure(list(ID = c(1, 2, 3, 4), First_Name = c("Gary", "Dana", "Lena", "Matt"), Last_name = c("Cole", "Benton", "Jamison", "King"), Third_Name = c("Allan", "NA", "Anne", "NA"), Whole_name = c("Gary Allan Cole", "Dana Benton", "Lena Anne Jamison", "Matt King"), Gender = c("M", "F", "F", "M")), .Names = c("ID", "First_Name", "Last_Name", "Third_Name", "Whole_Name", "Gender"), row.names = c(NA, -4L), class = "data.frame")
Roman
  • 4,744
  • 2
  • 16
  • 58
0

If you want to avoid the false matches to NA, don't include it in the pattern. Use this instead:

eup_tn <- paste(na.omit(as.character(b$"Third_name")), collapse = "|")

As to your second question: that's done using the merge() function in base R, or one of the replacements for it in ?dplyr::join, probably inner_join().

user2554330
  • 37,248
  • 4
  • 43
  • 90