0

This question is the culmination of two pieces of code guided by two answers here on SO. The first question I had was how to compare similarity between two strings and I got a good answer as seen here with the following code:

code 1

def get_bigrams(string):
    '''
    Takes a string and returns a list of bigrams
    '''
    s = string.lower()
    return { s[i:i+2] for i in range(len(s) - 1) }

def string_similarity(str1, str2):
    '''
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    '''
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    intersection = set(pairs1) & set(pairs2)
    return (2.0 * len(intersection)) / (len(pairs1) + len(pairs2))

After that I needed a way to sort the list of names for me to run them through the above code. I got the code here as seen below:

code 2

import itertools
persons = ["Peter parker", "Richard Parker", "Parker Richard", "Aunt May"]
similarity = []
for p1, p2 in itertools.combinations(persons, 2):
    similarity.append(string_similarity(p1,p2))
    print("%s - %s: " %(p1, p2) + " " + str(string_similarity(p1, p2)))

similarity = sorted(similarity, key=float)
print(similarity)

Now, the final hurdle is that my data is not in a list and is actually fetched from a database with primary keys which is what I ultimately want to track. Meaning when I compare multiple names, I need to mark that e.g. ID 1 and ID 2 are the most variant. For me to determine that those two IDs are the most variant, I need to sort the result of 'code1` above which looks like below:

Peter parker - Richard Parker:  0.5454545454545454
Peter parker - Parker Richard:  0.5454545454545454
Peter parker - Aunt May:  0.0
Richard Parker - Parker Richard:  0.8333333333333334
Richard Parker - Aunt May:  0.0
Parker Richard - Aunt May:  0.0
[0.0, 0.0, 0.0, 0.5454545454545454, 0.5454545454545454, 0.8333333333333334]

In my head instead of those names there I need the Primary IDs with which the names were fetched with so am thinking using a dictionary. Is there a way to run a dictionary of {PID:Name}, {PID1:Name1}, PID2:Name2} using code2, get the similarity value using code1, sort the result and then know that names with the highest similarity are PID1 and PID3? Or is there a more elegant and less hair pulling way than am currently thinking...

Community
  • 1
  • 1
lukik
  • 3,919
  • 6
  • 46
  • 89
  • 1
    To "sort" a dictionary in python see **[Python: Sort a dictionary by value](http://stackoverflow.com/questions/613183/python-sort-a-dictionary-by-value)** – MakeCents Sep 09 '13 at 14:28
  • I have gone through several of those "how to sort a dictionary" threads but in this case its different. I need a way to separate the value of my dictionary from the keys, put that value in a list, run the similarity code, get resultant score, sort that score by ascending and the figure out for which IDs that score was for. Better yet, is there another way to carry out this task even without using dictionary but still keep track of the database Primary ID? – lukik Sep 09 '13 at 14:44
  • Using the code form the previous link with a comprehension to slice into a list. Say that `p = {'PID':'Name', 'PID1':'Name1', 'PID2':'Name2'}` then `print [p[pid] for pid in sorted(p, key=lambda x: x in p)]` would create a list of values, or `['Name1', 'Name2', 'Name']`. There is always another way. – MakeCents Sep 09 '13 at 14:59
  • I mean `[p[pid] for pid in sorted(p, key=lambda x: x)]` will create `['Name', 'Name1', 'Name2']`. – MakeCents Sep 09 '13 at 15:36

1 Answers1

1

Yes, you need to associate the pair (ID, name). For this you can use a dict, a tuple or even a class. For example using tuples your code 2 would change to:

persons = [('id1', "Peter parker"), ('id2' ,"Richard Parker"), ('id3' ,"Parker Richard"), ('id4' ,"Aunt May")]
similarity = [[p1, p2, string_similarity(p1[1], p2[1])]
                for p1, p2 in itertools.combinations(persons, 2)]

similarity = sorted(similarity, key=lambda x: x[2], reverse=True)
for p1, p2, sim in similarity:    
    print "{} - {}: {}".format(p1, p2, sim)  # p1[0], p2[0] to show ids only

You get:

('id2', 'Richard Parker') - ('id3', 'Parker Richard'): 0.833333333333
('id1', 'Peter parker') - ('id2', 'Richard Parker'): 0.545454545455
('id1', 'Peter parker') - ('id3', 'Parker Richard'): 0.545454545455
('id1', 'Peter parker') - ('id4', 'Aunt May'): 0.0
('id2', 'Richard Parker') - ('id4', 'Aunt May'): 0.0
('id3', 'Parker Richard') - ('id4', 'Aunt May'): 0.0
elyase
  • 39,479
  • 12
  • 112
  • 119
  • You got it spot on. Thanks for taking the time. I would never have solved it at my current python skill level. – lukik Sep 09 '13 at 18:47