Comparing Columns of Names Between Two Dataframes Before Joining with Dplyr

Question

I'm wondering if there's an easy way to compare columns before doing a join in dplyr. Below are two simple dataframes. I want to do a full-join based on first and last names, however there are some spelling mistakes or different formats, such as "Elizabeth Ray" vs "Elizabeth".

I would like to compare these columns before joining. I'm hoping for a way that will produce a list or vector of all the differences with indexes so I can correct them before joining.

If there's an easier way, I'm open to that as well, however I'm hoping for the simplest method. I would like a solution based on dplyr, tidyr, and stringr.

FirstNames<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames<-c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
Pets<-c("Cat","Dog","Cat","Dog","Cat")
Names1<-data.frame(FirstNames,LastNames,Pets)

FirstNames2<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames2<-c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
Dwelling<-c("House","House","Apartment","Condo","House")
Names2<-data.frame(FirstNames2,LastNames2,Dwelling)

Maybe this: http://stackoverflow.com/questions/2231993/merging-two-data-frames-using-fuzzy-approximate-string-matching-in-r — zx8754, Jun 09 '16 at 21:23
Isn't there a way to use some kind of match function? This would look for any names that don't match between two columns? — Mike, Jun 10 '16 at 17:59

score 0 · Answer 1 · answered Jun 09 '16 at 21:27

0

I am posing in answer as i am not having access to Comment

df = Names1[!(Names1$LastNames %in% Names2$LastNames2), ]

Try the about code.

answered Jun 09 '16 at 21:27

Praveen DA

358
4
17

Thanks for this great answer and I will definitely check out the package. But I'm thinking there must be something simpler. Perhaps I wasn't clear enough in my explanation. It's not necessarily spelling mistakes, it's matches. Is there a match function to find words in one column of a dataframe that don't match in another column? That way names spelled incorrectly would be flagged as not matching? – Mike Jun 10 '16 at 17:58

leerssej · Accepted Answer · 2016-06-11T18:04:34.207

To compare similarities between your records I am thinking you might be looking for a way to apply a measure of Fuzzy Logic matching to your Name comparison task. AKA: applying a String Distance Function in performing your Record Linkage task. (Forgive me if you know all this already - but these keywords were a huge help to me in the beginning.)

There is a great package called stringdist that works very well for these applications, but recordlinkage is probably going to help you get to work aligning your dataframes most quickly.

If you wish to review the most similar values for first and last names on down to the most disparate you could use code like the following:

library(RecordLinkage)
library(dplyr)

id <- c(1:5) # added in to allow joining of data tables & comparison results
firstName <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName <- c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
pet <- c("Cat","Dog","Cat","Dog","Cat")
Names1 <- data.frame(id, firstName, lastName, pet)

id <- c(1:5) # added in to allow joining of data tables & comparison results
firstName2 <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName2 <- c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
dwelling <- c("House","House","Apartment","Condo","House")
Names2 <- data.frame(id, firstName2, lastName2, dwelling)

# RecordLinkage function that calculates string distance b/w records in two data frames
Results <- compare.linkage(Names1, Names2, blockfld = 1, strcmp = T, exclude = 4)
Results
#  $data1
#    firstName      lastName  pet
# 1      Chris    MacDougall  Cat
# 2       Doug       Shapiro  Dog
# 3   Shintaro      Yamazaki  Cat
# 4    Bubbles        Murphy  Dog
# 5       Elsa Elizabeth Ray  Cat

# $data2
#    firstName2  lastName2  dwelling
# 1       Chris  MacDougal     House
# 2        Doug    Shapiro     House
# 3    Shintaro   Yamazaku Apartment
# 4     Bubbles     Murphy     Condo
# 5        Elsa  Elizabeth     House

# $pairs
# id1 id2 id firstName  lastName is_match
# 1   1   1  1         1 0.9800000       NA
# 2   2   2  1         1 1.0000000       NA
# 3   3   3  1         1 0.9500000       NA
# 4   4   4  1         1 1.0000000       NA
# 5   5   5  1         1 0.9384615       NA

# $frequencies
# id firstName  lastName 
# 0.200     0.200     0.125 
# $type
# [1] "linkage"

# attr(,"class")
# [1] "RecLinkData"

# Trim $pairs dataframe (seen above) to contain just id's & similarity measures
PairsSelect <- 
    Results$pairs %>% 
    select(id = id1, firstNameSim = firstName, lastNameSim = lastName)

# Join original data & string comparison results together
# reorganize data to facilitate review
JoinedResults <-
    left_join(Names1, Names2) %>% 
    left_join(PairsSelect) %>% 
    select(id, firstNameSim, firstName, firstName2, lastNameSim, lastName, lastName2) %>% 
    arrange(desc(lastNameSim), desc(firstNameSim), id)
JoinedResults
# id firstNameSim firstName firstName2 lastNameSim      lastName lastName2
# 1  2            1      Doug       Doug   1.0000000       Shapiro   Shapiro
# 2  4            1   Bubbles    Bubbles   1.0000000        Murphy    Murphy
# 3  1            1     Chris      Chris   0.9800000    MacDougall MacDougal
# 4  3            1  Shintaro   Shintaro   0.9500000      Yamazaki  Yamazaku
# 5  5            1      Elsa       Elsa   0.9384615 Elizabeth Ray Elizabeth

# If you want to collect just the perfect matches
PerfectMatches <- 
    JoinedResults %>% 
    filter(firstNameSim == 1 & lastNameSim == 1) %>% 
    select(id, firstName, lastName)
PerfectMatches
#   id firstName lastName
# 1  2      Doug  Shapiro
# 2  4   Bubbles   Murphy

# To collect the matches that are going to need alignment
ImperfectMatches <- 
    JoinedResults %>% 
    filter(firstNameSim < 1 | lastNameSim < 1) %>% 
    mutate(flgFrstNm = 0, flgLstNm = 0)
ImperfectMatches
#   id firstNameSim firstName firstName2 lastNameSim      lastName lastName2 flgFrstNm flgLstNm
# 1  1            1     Chris      Chris   0.9800000    MacDougall MacDougal         0        0
# 2  3            1  Shintaro   Shintaro   0.9500000      Yamazaki  Yamazaku         0        0
# 3  5            1      Elsa       Elsa   0.9384615 Elizabeth Ray Elizabeth         0        0
# 

# If you want to enter your column preference in a flag column to facilitate faster rectification...
write.csv(ImperfectMatches, "ImperfectMatches.csv", na = "", row.names = F)
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
# Flag data externally - save file to new name with '_reviewed' appended to filename
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
#reload results
FlaggedMatches <- read.csv("ImperfectMatches_reviewed.csv", stringsAsFactors = F)
FlaggedMatches
## Where a 1 is the 1st data set preferred and 0 (or 2 if that is easier for the 'data processor') means the 2nd data set is preferred.
#   id firstNameSim firstName firstName2 lastNameSim      lastName lastName2 flgFrstNm flgLstNm
# 1  1            1     Chris      Chris   0.9800000    MacDougall MacDougal         1        0
# 2  3            1  Shintaro   Shintaro   0.9500000      Yamazaki  Yamazaku         1        1
# 3  5            1      Elsa       Elsa   0.9384615 Elizabeth Ray Elizabeth         1        0

## Executing Assembly of preferred/rectified firstName and lastName columns
ResolvedMatches <- 
    FlaggedMatches %>% 
    mutate(rectifiedFirstName = ifelse(flgFrstNm == 1,firstName, firstName2),
           rectifiedLastName = ifelse(flgLstNm == 1, lastName, lastName2)) %>% 
    select(id, starts_with("rectified"))

ResolvedMatches
# id rectifiedFirstName rectifiedLastName
# 1  1              Chris         MacDougal
# 2  3           Shintaro          Yamazaki
# 3  5               Elsa         Elizabeth

The dplyr is quite intutive to follow along with but the compare.linkage() function could use a little explanation.

The first two arguments are obvious: the two dataframes you are comparing (dataframe1 and dataframe2). [If you want to just compare records inside onedataframe to themselves(to dedupe the record set) then you can use compare.dedup() instead, and only make reference to one dataframe.

Setting blockfld to 1 or 2, in this case, will specify that matches must be 100% on First Name or Last Name respectively. Instead, you might want to include your primary/foreign key in your dataset and reference that column in the blckfld argument. Alternatively, if your records aren't actually so equivalently constructed, you can leave this argument out entirely (it defaults to FALSE) and then all possible combinations [the cross product of your dataframes] will be compared.

strcmp to TRUE gets you a string distance distance function applied to the data columns you are comparing; if you leave it false then it just tests exact 1:1 string correspondence.

exclude is also a nice way to avoid having to construct intermediate dataframes and select only the columns you wish to compare to one another: Excluding 3 simply allows us to drop the Pets and Dwelling comparison from the results.

The results produced from the 4 column, keyed, dataframes in the code above (not the original questions 3 column dataframes) are as below:

#  $data1
#    firstName      lastName  pet
# 1      Chris    MacDougall  Cat
# 2       Doug       Shapiro  Dog
# 3   Shintaro      Yamazaki  Cat
# 4    Bubbles        Murphy  Dog
# 5       Elsa Elizabeth Ray  Cat

# $data2
#    firstName2  lastName2  dwelling
# 1       Chris  MacDougal     House
# 2        Doug    Shapiro     House
# 3    Shintaro   Yamazaku Apartment
# 4     Bubbles     Murphy     Condo
# 5        Elsa  Elizabeth     House

# $pairs
# id1 id2 id firstName  lastName is_match
# 1   1   1  1         1 0.9800000       NA
# 2   2   2  1         1 1.0000000       NA
# 3   3   3  1         1 0.9500000       NA
# 4   4   4  1         1 1.0000000       NA
# 5   5   5  1         1 0.9384615       NA

# $frequencies
# id firstName  lastName 
# 0.200     0.200     0.125 
# $type
# [1] "linkage"

# attr(,"class")
# [1] "RecLinkData"

Each of the sections in the above (ex. $pairs) is its own data frame. Add a key and you can join them all together and then reference and use the values in pairs df as switching level gates to then even copy data1 values over into the data2 frame, say for example when you have a > 0.95 value in the pairing rating. (Note: is_match looks important, but it is for training matching tools, and is not relevant to our task here.)

In any case, I hope you find the sudden increase in power these libraries will allow you bring to your work as heady as I did when I first encountered them.

BTW: I also found this Comparison of String Distance Algorithms to be a great survey of the string distance metrics currently available.

Thanks for this great answer and I will definitely check out the package. But I'm thinking there must be something simpler. Perhaps I wasn't clear enough in my explanation. It's not necessarily spelling mistakes, it's matches. Is there a match function to find words in one column of a dataframe that don't match in another column? That way names spelled incorrectly would be flagged as not matching? — Mike, Jun 10 '16 at 17:57
Yup! That is exactly what you get as a result in the code above. You get a value between 1 and 0 for how much similarity there is in the strings. So if you filter just the >=0.95 values you will get the rows that have just a couple letter difference between the strings. You could also `arrange(desc())` on that column and you would get The exact same down to the values completely different at the bottom; then you could go through and push the easies together (or even `mutate(ifelse(pairsLastNames > 0.95, data2LastNames, data1LastName)) if join the $data1, $data2, and $pairs dataframes results. — leerssej, Jun 10 '16 at 20:52
Happy to code out the whole process if you like. Is the above a good solution or did you have another way that you wanted to change the values. (i.e. Did you prefer, when there is a tight match to overwrite the values of the first dataframe over the second or second over the first, or maybe you'd just want to put a flag next to the Oll Korrect values, and maybe a different flag next to the Mostly Complete values so you can then sort them out later for hand recoding?) — leerssej, Jun 10 '16 at 21:07
Thanks for taking the time to help! It would be great if you could code out the whole process. Depending on how much coding is involved, it would be nice to see how to overwrite the values of the second dataframe over the first, and also how to see the use of flags for hand recoding. I like the use of "mutate" as well in the comment above. I look forward to seeing how it works and I'm excited about learning how ot use this package. I had trouble articulating exactly what I wanted because I didn't know this was possible. Thanks again for hte help! — Mike, Jun 11 '16 at 02:44
@alistaire's suggestion to use `agrep` / `adist` is also a very effective solution. See for example: http://www.r-bloggers.com/fuzzy-string-matching-a-survival-skill-to-tackle-unstructured-information/ ( these native R methods employ the generalized Levenshtein (edit) distance, which in this instance is more than sufficient, and probably most relevant to your processing approach, in fact.) — leerssej, Jun 11 '16 at 18:19

score 0 · Answer 3 · answered Jun 12 '16 at 02:31

0

Using the standard adist() function as @alistaire suggested offers a very efficient approach (and probably is more likely what the instructor sought to see employed.) adist's string metrics are restricted to generalized Levenshtein (edit) distance, but this looks to be exactly what you are seeking.

Code is as follows: (Since this looks like an intro to R coding class specifically for data handling I added in some best practices polish to the reproducible/question posed.)

library(dplyr)

id <- c(1:5)
firstName <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName <- c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
pet <- c("Cat","Dog","Cat","Dog","Cat")
Names1 <- data.frame(id, firstName, lastName, pet)

id <- c(1:5)
firstName2 <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName2 <- c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
dwelling <- c("House","House","Apartment","Condo","House")
Names2 <- data.frame(id, firstName2, lastName2, dwelling)
# NB: technically you could merge these data frames later with `bind_cols()` but best 
# datahandling practices dictate joining/comparing data based on keys (instead of 
# binding columns together based upon the order in which tables are initially arranged.)
#[also preference is for column headers to be singular and lower case, and tables/dataframes to be uppercase and plural - from (or extension from principles in): https://google.github.io/styleguide/Rguide.xml]

## adist() calculates string distance b/w records in two data frames
# Matrix between all columns is great way to ascertain similarity of data
# on overall column to column basis.
# 0 is closest resemblance, higher numbers are lowest resemblance
ResultsInterColumnComparison <-
    adist(Names1, Names2, partial = T)
ResultsInterColumnComparison

# firstName to firstName2 & lastName to LastName2 are similar columns.
#           id firstName2 lastName2 dwelling
# id         0          2         2        2
# firstName 15          0         3        4
# lastName  15          3         0        5
# pet       15          5         4        3

# adist column to column DifferenceCount (using dplyr)
dltFrstN <- diag(adist(Names1$firstName, Names2$firstName2, partial = T))
dltLstN <- diag(adist(Names1$lastName, Names2$lastName2, partial = T))

# Join all info together
DFcompilation <- 
    data.frame(id, dltFrstN, firstName, firstName2, dltLstN, lastName, lastName2) %>% 
    arrange(desc(dltLstN), desc(dltLstN))
DFcompilation
#   id dltFrstN firstName firstName2 dltLstN      lastName lastName2
# 1  5        0      Elsa       Elsa       4 Elizabeth Ray Elizabeth
# 2  1        0     Chris      Chris       1    MacDougall MacDougal
# 3  3        0  Shintaro   Shintaro       1      Yamazaki  Yamazaku
# 4  2        0      Doug       Doug       0       Shapiro   Shapiro
# 5  4        0   Bubbles    Bubbles       0        Murphy    Murphy

This approach is simpler, and much more concise in the coding required. I hope this is more helpful for your purposes, too.

answered Jun 12 '16 at 02:31

leerssej

14,260
6
48
57

Thanks! The adist option does seem simpler. Since in reality I will be working with a dataset uploaded from a csv file, I guess I will have to use dplyr "select" to pull out the "firstName","firstName2","lastName", and "lastName2" columns so that I can create the new dataframe in the last step of the example above? Also, to make things more complicated, how can one dataframe be overwritten by the other? Say for example, if I wanted to replace all the names with a dltLstN higher than 0 in the Names2 dataframe with the names in the Names1 dataframe? – Mike Jun 12 '16 at 04:42
I'm also having some trouble on the last step of the above example - joining everything together. I used Dplyr "select" to create separate objects for "firstName","firstName2","lastName", and "lastName2", as well as ID, but when I try to put them together, there is an error that says there is a "difference in the number of rows" . I'm still relatively new to R, so there is probably something obvious I'm missing. It's also late, so my brain might not be working properly. Any points you can suggest would be greatly appreciated! – Mike Jun 12 '16 at 05:09
I hope I'm not making things too complicated. Basically, my ultimate goal is simply to find all the sets of different spellings, and then to overwrite one column with the other so I can join dataframes. This is the main goal. – Mike Jun 12 '16 at 05:26
Overwriting or making a 'preferred column' can both get you what you seek. The only difference is that the former option has a `mutate()` that starts and ends with the name of one of the constituent columns, while the second method's mutate makes a column with a whole new name. You are still going to have to figure out which one to choose. The way you select the column can be by using a flag like in the RecordLinkage example, or if you prefer some auto selecting to use the shortest one - your `ifelse()` can balance upon an `nrow()` comparison. – leerssej Jun 13 '16 at 00:48
The main goal should be completely solved by executing the complete solution in the RecordLinkage solution :-D – leerssej Jun 13 '16 at 00:52
For merging dataframes in dplyr, you can use `left_join()` just like so: `JoinedResults <- left_join(Names1, Names2)` but you NEED a key to join on. Think of it like needing to at least have a designated start and finish to put a jacket together correctly. Otherwise, if you just start buttoning hole and buttons where-ever, you will end up with a very interesting result. Pure chance may get you a perfect result sometimes, but when the button holes and buttons start to number above 1, your chances are going to decline rapidly without some way to match buttons to holes correctly. – leerssej Jun 13 '16 at 01:00
Thanks for the continuing help. I realize now that my example is too simplistic. The actual data I'm working with actually contain a different number of names in the firstnames and lastnames columns, and are in different orders, so I realize that this will change the code, especially since it's not possible to simply create an ID column. I also get an "Error: All select() inputs must resolve to integer column positions" when I try to make changes to the Results$Pairs dataframe when using RecordLinkage. I think this is probably because of the differing number of rows between the two dataframes – Mike Jun 13 '16 at 02:17
Since this discussion is becoming long, I'm going to create another question page with new data. I'm going ot accept this answer since it helped, and pose the additonal questions as a new question. So hopefully you can find my new post and help me there as well...thanks again for all the help! – Mike Jun 13 '16 at 02:18
I moved this topic to this new page "Using RecordLinkage to Match Unequal Names Columns in Two Dataframes Before Joining with Dplyr" – Mike Jun 13 '16 at 04:06
The data in this new page better represents the actual data I'm working with. Not sure how much your answers on this page will have to change if the dataframe name columns are in different orders and contain different numbers of names. – Mike Jun 13 '16 at 04:07

Comparing Columns of Names Between Two Dataframes Before Joining with Dplyr

3 Answers3