Imperfect String Matching

Question

Say I have two columns of names. All names in the first column are in the second column, but in a random order, AND some of them are not perfect matches. So maybe in one column theres the name John Smith and in the second John_smith or JonSmith. Is there any fairly simple R way of performing a "best match"?

If you have more columns than just first name and last name (e.g., you have date of birth and address too, or whatever) and want to find rows that may be matches, look at the `RecordLinkage` package. http://cran.r-project.org/web/packages/RecordLinkage/index.html — Richie Cotton, Feb 08 '12 at 16:27

score 10 · Accepted Answer · edited May 23 '17 at 12:16

10

Given some data like this:

df<-data.frame(x=c('john doe','john smith','sally struthers'),y=c('John Smith','John_smith','JonSmith'))

You can get a long way with a few gsubs and tolower:

df$y.fix <- gsub('[[:punct:]]', ' ', df$y)
df$y.fix <- gsub(' ', '', df$y.fix)
df$y.fix <- tolower(df$y.fix)
df$x.fix <- tolower(gsub(' ', '', df$x))

Then agrep is what you'll want:

> agrep(df$x.fix[2], df$y.fix)
[1] 1 2 3

for more complex confusing strings, see this post from last week.

edited May 23 '17 at 12:16

Community

1
1

answered Feb 08 '12 at 15:54

Justin

42,475
9
93
111

+1 for `tolower()` and `gsub()` out stuff that would otherwise be overcounted in levenshtein distances. – Brandon Bertelsen Feb 08 '12 at 16:23

Imperfect String Matching

1 Answers1