I have to compare two people datasets between each other.
Let's say i have a data frame with few columns a =
.
ID | Name | Gender | Country
——————————————————————————————————————————————————————————
1 | Mattias Adams | M | UK
2 | James Alan | M | Canada
3 | Dana Benton | F | USA
4 | Ella Collins | F | USA
And b =
ID | First_Name | Last_name | Third_name | Whole_name | Gender
————————————————————————————————————————————————————————————————————————————
1 | Gary | Cole | Allan | Gary Allan Cole | M
2 | Dana | Benton | NA | Dana Benton | F
3 | Lena | Jamison | Anne | Lena Anne Jamison | F
4 | Matt | King | NA | Matt King | M
Data frame a
is the bigger one contains around 100,000 rows while b
contains less than 1,000.
The goal is to use the data in b
to find matching records in a
. So that the whole row in a
is returned if there is a match.
I want to try two ways. First to find exact matches from b$"Whole_name"
in a$"Name"
.
Exact matching:
eue_wn <- as.character(b$"Whole_name")
eue_wn_match <- a[which(as.character(a$"Name") %in% eue_wn),]
if (nrow(eue_wn_match) == 0) {
eue_wn_match <- "No matches"
}
Output of eue_wn_matc
in this case would be:
ID | Name | Gender | Country
——————————————————————————————————————————————————————————
3 | Dana Benton | F | USA
Pattern matching:
eup_ln <- paste(as.character(b$"Last_name"), collapse = "|")
eup_fn <- paste(as.character(b$"First_Name"), collapse = "|")
eup_tn <- paste(as.character(b$"Third_name"), collapse = "|")
eup_match <- a[which(grepl(eup_ln, as.character(a$"Name"), ignore.case = TRUE)),] #First filter (last name)
if (nrow(eup_match) == 0) {
eup_match <- "No matches"
}
if (nrow(eup_match) > 0) {
eup_match2 <- eup_match[which(grepl(eup_fn, as.character(eup_match$"Name"), ignore.case = TRUE)),] #Second filter (first name)
if (nrow(eup_match2) == 0 ) {
eup_match2 <- "No matches"
}
}
if (nrow(eup_match2) > 0) {
eup_match3 <- eup_match2[which(grepl(eup_tn, as.character(eup_match2$"Name"), ignore.case = TRUE)),] #Third filter (third_name)
if (nrow(eup_match3) == 0 ) {
eup_match3 <- "No matches"
}
}
So in this process the matching takes place in 3 stages.
First eup_match
is the result of finding the last name. Than it takes that result and looks for second match which is the first name name results eup_match2
shows record that matches both conditions. Finally the last result is taken and is being matched also with the third name eup_match3
in this case the result of all three of them is the same:
ID | Name | Gender | Country
——————————————————————————————————————————————————————————
3 | Dana Benton | F | USA
And that is incorrect. Only eup_match
and eup_match2
should have that output. Since in the first stage we were matching Dana Benton(a)
and Dana(b)
In the next stage the match was Dana Benton(a)
and Benton (b)
. And since she does not have a third name it is impossible to match her with third name.
The problem is in:
eup_tn <- paste(as.character(b$"Third_name"), collapse = "|")
The output off this is :
"Allan|NA|Anne|NA"
Because the NA was converted to character the function was able to find pattern in a and b. In this particular case Dana Benson (a
) and NA (b)
Any idea on how to correct that ?
Another question is related to the output. Is there any way to output both results from a
and b
Example: if we are only matching the a$Name
with b$First_Name
by patterns the result would be
ID | Name | Gender | Country | Match | Match ID
———————————————————————————————————————————————————————————————————————————
1 | Mattias Adams | M | UK | Matt | 4
3 | Dana Benton | F | USA | Dana | 2
So that the first 4 columns are from the data set a
and the last two from b
Columns Match | Match ID
would show based on what were the records in b
matched.
The desired output for the test example given would be:
ID | Name | Gender | Country
——————————————————————————————————————————————————————————
3 | Dana Benton | F | USA
Sorry for the long post. I tried to make it as clear as possible. If anyone would like to recreate this, xlsx files a
and b
as well as the r code can be found here: MyDropbox
If anyone has other suggestions on how to approach this topic is welcome to present them. Thank you for the help.