5

Say I have two columns of names. All names in the first column are in the second column, but in a random order, AND some of them are not perfect matches. So maybe in one column theres the name John Smith and in the second John_smith or JonSmith. Is there any fairly simple R way of performing a "best match"?

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
JoshDG
  • 3,871
  • 10
  • 51
  • 85
  • If you have more columns than just first name and last name (e.g., you have date of birth and address too, or whatever) and want to find rows that may be matches, look at the `RecordLinkage` package. http://cran.r-project.org/web/packages/RecordLinkage/index.html – Richie Cotton Feb 08 '12 at 16:27

1 Answers1

10

Given some data like this:

df<-data.frame(x=c('john doe','john smith','sally struthers'),y=c('John Smith','John_smith','JonSmith'))

You can get a long way with a few gsubs and tolower:

df$y.fix <- gsub('[[:punct:]]', ' ', df$y)
df$y.fix <- gsub(' ', '', df$y.fix)
df$y.fix <- tolower(df$y.fix)
df$x.fix <- tolower(gsub(' ', '', df$x))

Then agrep is what you'll want:

> agrep(df$x.fix[2], df$y.fix)
[1] 1 2 3

for more complex confusing strings, see this post from last week.

Community
  • 1
  • 1
Justin
  • 42,475
  • 9
  • 93
  • 111