Comparing 2 strings with similar letters

Question

I would like to know how to compare 2 different strings through a function in Python. More specifically, how similar 2 different strings are, and their similarity as a percentage (the letters that appear in both strings). Thanks in advance.

Possible dup: http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison — aioobe, Mar 18 '12 at 21:13
The asker is asking for a module that can do it. I am asking how to do it without different modules — bahaaz, Mar 18 '12 at 21:16
I am working on a simple program (not homework) and for a portion of my program I need to no how to do this. — bahaaz, Mar 18 '12 at 21:22
Can you describe the algorithm you wish to implement. This is too vague. — David Heffernan, Mar 18 '12 at 21:23
For instance I call a function with the name 'compare': compare("hello","yellow"). An the function calculates the percentage of similar letters — bahaaz, Mar 18 '12 at 21:26
Start by telling us what the answer should be for `compare('hello', 'yellow')` and why. — Karl Knechtel, Mar 18 '12 at 21:57

score 1 · Answer 1 · edited Mar 18 '12 at 23:59

1

You might look at difflib for various ways of comparing the strings and getting differences. Looks like difflib.Differ.compare(string1, string2) will get you an iterator which produces lines. Lines prefixed with - are in one string, lines with a blank prefix are in both strings, and lines prefixed with + are in the other string.

edited Mar 18 '12 at 23:59

Morten Siebuhr

6,068
4
31
43

answered Mar 18 '12 at 21:23

Pierce

564
2
8

score 1 · Accepted Answer · answered Mar 18 '12 at 21:36

def pctSame(s1,s2):
    # Make sorted arrays of string chars
    s1c = [x for x in s1]
    s1c.sort()
    s2c = [x for x in s2]
    s2c.sort()
    i1 = 0
    i2 = 0
    same = 0
    # "merge" strings, counting matches
    while ( i1<len(s1c) and i2<len(s2c) ):
        if s1c[i1]==s2c[i2]:
            same += 2
            i1 += 1
            i2 += 1
        elif s1c[i1] < s2c[i2]:
            i1 += 1
        else:
            i2 += 1
    # Return ratio of # of matching chars to total chars
    return same/float(len(s1c)+len(s2c))

score 0 · Answer 3 · answered Mar 18 '12 at 21:29

String similarity is a metric that depends on what you are measuring. Are you trying to match a mistyped word to the intended word in the dictionary? Comparing DNA or protein sequences? Trying to do document retrieval based on similarity to a search query? Doing fuzzy name matching? For each of these tasks, a different algorithm might be appropriate. If you're really asking a fully general question, you might start by reading about Levenshtein distance.

score 0 · Answer 4 · answered Mar 18 '12 at 21:56

0

The SequenceMaster from difflib is almost what you're looking for. It hands out a score between 0 and 1, depending on how much they look like eachother.

answered Mar 18 '12 at 21:56

Morten Siebuhr

6,068
4
31
43

Comparing 2 strings with similar letters

4 Answers4