To compare similarities between your records I am thinking you might be looking for a way to apply a measure of Fuzzy Logic matching to your Name comparison task. AKA: applying a String Distance Function
in performing your Record Linkage task. (Forgive me if you know all this already - but these keywords were a huge help to me in the beginning.)
There is a great package called stringdist
that works very well for these applications, but recordlinkage is probably going to help you get to work aligning your dataframes most quickly.
If you wish to review the most similar values for first and last names on down to the most disparate you could use code like the following:
library(RecordLinkage)
library(dplyr)
id <- c(1:5) # added in to allow joining of data tables & comparison results
firstName <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName <- c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
pet <- c("Cat","Dog","Cat","Dog","Cat")
Names1 <- data.frame(id, firstName, lastName, pet)
id <- c(1:5) # added in to allow joining of data tables & comparison results
firstName2 <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName2 <- c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
dwelling <- c("House","House","Apartment","Condo","House")
Names2 <- data.frame(id, firstName2, lastName2, dwelling)
# RecordLinkage function that calculates string distance b/w records in two data frames
Results <- compare.linkage(Names1, Names2, blockfld = 1, strcmp = T, exclude = 4)
Results
# $data1
# firstName lastName pet
# 1 Chris MacDougall Cat
# 2 Doug Shapiro Dog
# 3 Shintaro Yamazaki Cat
# 4 Bubbles Murphy Dog
# 5 Elsa Elizabeth Ray Cat
# $data2
# firstName2 lastName2 dwelling
# 1 Chris MacDougal House
# 2 Doug Shapiro House
# 3 Shintaro Yamazaku Apartment
# 4 Bubbles Murphy Condo
# 5 Elsa Elizabeth House
# $pairs
# id1 id2 id firstName lastName is_match
# 1 1 1 1 1 0.9800000 NA
# 2 2 2 1 1 1.0000000 NA
# 3 3 3 1 1 0.9500000 NA
# 4 4 4 1 1 1.0000000 NA
# 5 5 5 1 1 0.9384615 NA
# $frequencies
# id firstName lastName
# 0.200 0.200 0.125
# $type
# [1] "linkage"
# attr(,"class")
# [1] "RecLinkData"
# Trim $pairs dataframe (seen above) to contain just id's & similarity measures
PairsSelect <-
Results$pairs %>%
select(id = id1, firstNameSim = firstName, lastNameSim = lastName)
# Join original data & string comparison results together
# reorganize data to facilitate review
JoinedResults <-
left_join(Names1, Names2) %>%
left_join(PairsSelect) %>%
select(id, firstNameSim, firstName, firstName2, lastNameSim, lastName, lastName2) %>%
arrange(desc(lastNameSim), desc(firstNameSim), id)
JoinedResults
# id firstNameSim firstName firstName2 lastNameSim lastName lastName2
# 1 2 1 Doug Doug 1.0000000 Shapiro Shapiro
# 2 4 1 Bubbles Bubbles 1.0000000 Murphy Murphy
# 3 1 1 Chris Chris 0.9800000 MacDougall MacDougal
# 4 3 1 Shintaro Shintaro 0.9500000 Yamazaki Yamazaku
# 5 5 1 Elsa Elsa 0.9384615 Elizabeth Ray Elizabeth
# If you want to collect just the perfect matches
PerfectMatches <-
JoinedResults %>%
filter(firstNameSim == 1 & lastNameSim == 1) %>%
select(id, firstName, lastName)
PerfectMatches
# id firstName lastName
# 1 2 Doug Shapiro
# 2 4 Bubbles Murphy
# To collect the matches that are going to need alignment
ImperfectMatches <-
JoinedResults %>%
filter(firstNameSim < 1 | lastNameSim < 1) %>%
mutate(flgFrstNm = 0, flgLstNm = 0)
ImperfectMatches
# id firstNameSim firstName firstName2 lastNameSim lastName lastName2 flgFrstNm flgLstNm
# 1 1 1 Chris Chris 0.9800000 MacDougall MacDougal 0 0
# 2 3 1 Shintaro Shintaro 0.9500000 Yamazaki Yamazaku 0 0
# 3 5 1 Elsa Elsa 0.9384615 Elizabeth Ray Elizabeth 0 0
#
# If you want to enter your column preference in a flag column to facilitate faster rectification...
write.csv(ImperfectMatches, "ImperfectMatches.csv", na = "", row.names = F)
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
# Flag data externally - save file to new name with '_reviewed' appended to filename
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
#reload results
FlaggedMatches <- read.csv("ImperfectMatches_reviewed.csv", stringsAsFactors = F)
FlaggedMatches
## Where a 1 is the 1st data set preferred and 0 (or 2 if that is easier for the 'data processor') means the 2nd data set is preferred.
# id firstNameSim firstName firstName2 lastNameSim lastName lastName2 flgFrstNm flgLstNm
# 1 1 1 Chris Chris 0.9800000 MacDougall MacDougal 1 0
# 2 3 1 Shintaro Shintaro 0.9500000 Yamazaki Yamazaku 1 1
# 3 5 1 Elsa Elsa 0.9384615 Elizabeth Ray Elizabeth 1 0
## Executing Assembly of preferred/rectified firstName and lastName columns
ResolvedMatches <-
FlaggedMatches %>%
mutate(rectifiedFirstName = ifelse(flgFrstNm == 1,firstName, firstName2),
rectifiedLastName = ifelse(flgLstNm == 1, lastName, lastName2)) %>%
select(id, starts_with("rectified"))
ResolvedMatches
# id rectifiedFirstName rectifiedLastName
# 1 1 Chris MacDougal
# 2 3 Shintaro Yamazaki
# 3 5 Elsa Elizabeth
The dplyr is quite intutive to follow along with but the compare.linkage()
function could use a little explanation.
The first two arguments are obvious: the two dataframes you are comparing (dataframe1 and dataframe2). [If you want to just compare records inside onedataframe to themselves(to dedupe the record set) then you can use compare.dedup()
instead, and only make reference to one dataframe.
Setting blockfld
to 1 or 2, in this case, will specify that matches must be 100% on First Name or Last Name respectively. Instead, you might want to include your primary/foreign key in your dataset and reference that column in the blckfld argument. Alternatively, if your records aren't actually so equivalently constructed, you can leave this argument out entirely (it defaults to FALSE
) and then all possible combinations [the cross product of your dataframes] will be compared.
strcmp
to TRUE
gets you a string distance distance function applied to the data columns you are comparing; if you leave it false then it just tests exact 1:1 string correspondence.
exclude
is also a nice way to avoid having to construct intermediate dataframes and select only the columns you wish to compare to one another: Excluding 3
simply allows us to drop the Pets and Dwelling comparison from the results.
The results produced from the 4 column, keyed, dataframes in the code above (not the original questions 3 column dataframes) are as below:
# $data1
# firstName lastName pet
# 1 Chris MacDougall Cat
# 2 Doug Shapiro Dog
# 3 Shintaro Yamazaki Cat
# 4 Bubbles Murphy Dog
# 5 Elsa Elizabeth Ray Cat
# $data2
# firstName2 lastName2 dwelling
# 1 Chris MacDougal House
# 2 Doug Shapiro House
# 3 Shintaro Yamazaku Apartment
# 4 Bubbles Murphy Condo
# 5 Elsa Elizabeth House
# $pairs
# id1 id2 id firstName lastName is_match
# 1 1 1 1 1 0.9800000 NA
# 2 2 2 1 1 1.0000000 NA
# 3 3 3 1 1 0.9500000 NA
# 4 4 4 1 1 1.0000000 NA
# 5 5 5 1 1 0.9384615 NA
# $frequencies
# id firstName lastName
# 0.200 0.200 0.125
# $type
# [1] "linkage"
# attr(,"class")
# [1] "RecLinkData"
Each of the sections in the above (ex. $pairs) is its own data frame.
Add a key and you can join them all together and then reference and use the values in pairs df as switching level gates to then even copy data1 values over into the data2 frame, say for example when you have a > 0.95 value in the pairing rating.
(Note: is_match looks important, but it is for training matching tools, and is not relevant to our task here.)
In any case, I hope you find the sudden increase in power these libraries will allow you bring to your work as heady as I did when I first encountered them.
BTW: I also found this Comparison of String Distance Algorithms to be a great survey of the string distance metrics currently available.