Comparing similarity between multiple strings with a random starting point

Question

I have a bunch of people names that are tied to their respective Identifying Numbers (e.g. Social Security Number/National ID/Passport Number). Due to duplication though, one Identity Number can have upto 100 names which could be similar or totally different. E.g. ID 221 could have the names Richard Parker, Mary Parker, Aunt May, Parker Richard, M@rrrrryy Richard etc etc. Some typos but some totally different names.

Initially, I want to display only 3 (or a similar small number) of the names that are as different as possible from the rest so as to alert that viewer that the multiple names could not be typos but could be even a case of identity theft or negligent data capture or anything else!

I've read up on an algorithm to detect similarity and am currently looking at this one which would allow you to compute a score and a score of 1 means the two strings are the same while a lower score means they are dissimilar. In my use case, how can I go through say the 100 names and display the 3 that are most dissimilar? The algorithm for that just escapes my mind as I feel like I need a starting point and then look and compare among all others and loop again etc etc

score 2 · Accepted Answer · edited May 23 '17 at 12:02

2

Take the function from https://stackoverflow.com/a/14631287/1082673 as you mentioned and iterate over all combinations in your list. This will work if you have not that many entries, otherwise the computation time can increase pretty fast…

Here is how to generate the pairs for a given list:

import itertools

persons = ['person1', 'person2', 'person3']

for p1, p2 in itertools.combinations(persons, 2):
    print "Compare", p1, "and", p2

edited May 23 '17 at 12:02

Community

1
1

answered Sep 08 '13 at 22:12

tamasgal

24,826
18
96
135

Hi. I have an extra question emanating from your answer about sorting a dictionary with itertools. I had not asked it here so it wouldn't had been fair to edit the question. The new question can be seen [here](http://stackoverflow.com/q/18699961/1082673). Thanks for your assistance. – lukik Sep 09 '13 at 14:01

Comparing similarity between multiple strings with a random starting point

1 Answers1

Linked