Compare strings for an approximate match

Question

I have two datasets:

data1 is like

  id           name
1  1         toyota
2  2        walmart
3  3 fox ad company

data2 is like

  id                      name
1  1             sales walmart
2  2 fox advertisement company
3  3              metro toyota

Consider in this instance that we expect to find all the names of data1, in data2's names.

How to do this match? If we find a match between data1 and data2, we would print the id of data1.

For example:

  id           name data2
1  1         toyota     3
2  2        walmart     1
3  3 fox ad company     2

This is very unclear with the reference to "dataset" above and then "list" below. Plus it's not reproducible. Are these data frames? Lists? What is the desired output? Please use `dput` to supply each set of words — Rich Scriven, Jan 16 '15 at 00:56
@RichardScriven i do not think it is important it is list or others...I can change it list or data.frame if you want. the result I expect is, if we find a match from list1 and list2, just print the id of list2. — rwrwerwer, Jan 16 '15 at 01:11
It's not that *I* want it that way. It's more about following the SO guidelines for asking a question. — Rich Scriven, Jan 16 '15 at 01:13
@rwrwerwer Please read [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). You should create a reproducible example. — agstudy, Jan 16 '15 at 01:16
I voted to close (and would rescind if you heeded Richard's advice) based on "Questions seeking debugging help...must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself" — Tyler Rinker, Jan 16 '15 at 01:31
I think this is all a bit harsh in this instance. While not an ideally framed question, it's fairly easily understood. The required output is also specified reasonably enough: "print the id of list1". — thelatemail, Jan 16 '15 at 01:48
@thelatemail the user was asked to conform to expectations. This was not heeded. Not following through with consistent votes to close sends the message that we have expectations but they're optional. We explained the rationale to give the OP feedback and an opportunity to adjust. I'd say this was fair treatment as this is the user's 7th post. — Tyler Rinker, Jan 16 '15 at 02:29
See also: http://stats.stackexchange.com/questions/3425/how-to-quasi-match-two-vectors-of-strings-in-r http://stackoverflow.com/questions/16145064/approximate-string-matching-in-r and http://stackoverflow.com/questions/2231993/merging-two-data-frames-using-fuzzy-approximate-string-matching-in-r which probably would be a duplicate if it didn't use an off-site resource that is 404. — Matthew Lundberg, Jan 16 '15 at 02:49

score 7 · Answer 1 · answered Jan 16 '15 at 01:15

Assuming you have:

one <- c("toyota","walmart","fox ad company")
two <- c("sales walmart","fox advertisement company","metro toyota")

You could extract the match with the minimum string distance, as calculated by adist. This is probably open to error, but it will give you a start. See ?adist for how you might edit this to only look at additions, substitutions or insertions of characters.

max.col(-adist(one,two))
#[1] 3 1 2

Matches up okay:

data.frame(one, two=two[max.col(-adist(one,two))])
#             one                       two
#1         toyota              metro toyota
#2        walmart             sales walmart
#3 fox ad company fox advertisement company

Compare strings for an approximate match

1 Answers1

Linked