Say I have two columns of names. All names in the first column are in the second column, but in a random order, AND some of them are not perfect matches. So maybe in one column theres the name John Smith and in the second John_smith or JonSmith. Is there any fairly simple R way of performing a "best match"?
Asked
Active
Viewed 1,804 times
5
-
If you have more columns than just first name and last name (e.g., you have date of birth and address too, or whatever) and want to find rows that may be matches, look at the `RecordLinkage` package. http://cran.r-project.org/web/packages/RecordLinkage/index.html – Richie Cotton Feb 08 '12 at 16:27
1 Answers
10
Given some data like this:
df<-data.frame(x=c('john doe','john smith','sally struthers'),y=c('John Smith','John_smith','JonSmith'))
You can get a long way with a few gsub
s and tolower
:
df$y.fix <- gsub('[[:punct:]]', ' ', df$y)
df$y.fix <- gsub(' ', '', df$y.fix)
df$y.fix <- tolower(df$y.fix)
df$x.fix <- tolower(gsub(' ', '', df$x))
Then agrep
is what you'll want:
> agrep(df$x.fix[2], df$y.fix)
[1] 1 2 3
for more complex confusing strings, see this post from last week.
-
+1 for `tolower()` and `gsub()` out stuff that would otherwise be overcounted in levenshtein distances. – Brandon Bertelsen Feb 08 '12 at 16:23