Matching dissimilar strings

Question

Assume two sets of strings:

[ "Mr. Jones", "O'Flaherty", "Bob", "Rob Jenkins" ]
[ "Maxwell O'Flaherty", "Robert Jenkins", "Mrs. Smith" ]

It is obvious that those two sets have Maxwell O'Flaherty and Robert Jenkins in common.

Is there any algorithm that will allow us to do such matching programatically? I am thinking of writing something that will go through each element in an array of strings and try to find any substring that is unique and not contained in any other element in either of the sets and then use that as a kind of hash of each element to match up the two sets.

You should reveal, what names should be treated as the same. As I'm not familiar with english names, it is not obvious for me, that "those two sets have Maxwell O'Flaherty and Robert Jenkins in common". And not obvious for C# compiler. As for you it is not obvious that "Sasha Ivanov" and "Alexandr Petrovich Ivanov" is the same, but not the same as "Alexey Ivanov". — Vovanium, Nov 13 '10 at 20:27
I agree, a computer would have as little chance of matching Sasha and Alexandr as it would have of matching Richard and Dick. The issue is not surnames but simply matching similar strings. — devprog, Nov 14 '10 at 10:15
Probably a duplicate of: [http://stackoverflow.com/questions/83777/are-there-any-fuzzy-search-or-string-similarity-functions-libraries-written-for-c](http://stackoverflow.com/questions/83777/are-there-any-fuzzy-search-or-string-similarity-functions-libraries-written-for-c) — Jakub Konecki, Nov 13 '10 at 14:18
Thanks for pointing me to that answer. I'll try get my head round it and see if my issue is significantly different in any way. — devprog, Nov 13 '10 at 14:23

score 1 · Answer 1 · answered Nov 13 '10 at 14:27

1

You may find the Levenshtein distance useful. If you are doing a lot of this where it is unclear how accurate the information is there are libraries for string disambiguation. (It's not "obvious" that Rob and Robert are identical - indeed the first one could be Robin.

answered Nov 13 '10 at 14:27

peter.murray.rust

37,407
44
153
217

I am looking into the levenshtein distances as per the first answer. I take your point though that they can only tell you how close the strings are and cannot guarantee that the human meaning of Rob is Robert instead of Robin. I am going to have to find a way of comparing distances across all the elements in a set to establish a mean below which something cannot be considered a match. – devprog Nov 13 '10 at 15:38

score 0 · Answer 2 · answered Nov 13 '10 at 14:57

0

If this is real world example and you need exact match on either name or surname then parse all string in second array and create new array with all parsed substrings and store index to original array's elements that the substring is part of:

[ {"Maxwell", 0}, {"O'Flaherty", 0}, {"Robert", 1}, {"Jenkins", 1}, {"Mrs.", 2}, {"Smith", 2} ]

Now you can find exact match, and know to what person it relates.

answered Nov 13 '10 at 14:57

Dialecticus

16,400
7
43
103

The name and surname element is just an example, they could be any strings so I didn't want to rely on identifying single components like names and surnames. Thanks for your answer though. – devprog Nov 13 '10 at 15:35

score 0 · Answer 3 · answered Nov 14 '10 at 17:49

One approach I've used in the past to deal with issues like Robert vs Bob is by making queries to internet sources that can identify the similarities.

For example, I don't know about Wolfram Alpha's automated search policy (though I think they were working on an API at some point), but a search for Robert ( http://www.wolframalpha.com/input/?i=robert ) would identify that it should be matched up with the name "Rob".

Also, this is not at all programmatic, but I found that clever use of Amazon's Mechanical Turk works wonders for this type of problem if your dataset is reasonably constrained in size.

Matching dissimilar strings

3 Answers3