I would like to know how to compare 2 different strings through a function in Python. More specifically, how similar 2 different strings are, and their similarity as a percentage (the letters that appear in both strings). Thanks in advance.
-
2Possible dup: http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison – aioobe Mar 18 '12 at 21:13
-
The asker is asking for a module that can do it. I am asking how to do it without different modules – bahaaz Mar 18 '12 at 21:16
-
1@bahaaz why? is this homework? what have tried? – Roman Bodnarchuk Mar 18 '12 at 21:20
-
I am working on a simple program (not homework) and for a portion of my program I need to no how to do this. – bahaaz Mar 18 '12 at 21:22
-
1Can you describe the algorithm you wish to implement. This is too vague. – David Heffernan Mar 18 '12 at 21:23
-
For instance I call a function with the name 'compare': compare("hello","yellow"). An the function calculates the percentage of similar letters – bahaaz Mar 18 '12 at 21:26
-
2Start by telling us what the answer should be for `compare('hello', 'yellow')` and why. – Karl Knechtel Mar 18 '12 at 21:57
4 Answers
You might look at difflib for various ways of comparing the strings and getting differences. Looks like difflib.Differ.compare(string1, string2)
will get you an iterator which produces lines. Lines prefixed with -
are in one string, lines with a blank prefix are in both strings, and lines prefixed with +
are in the other string.

- 6,068
- 4
- 31
- 43

- 564
- 2
- 8
def pctSame(s1,s2):
# Make sorted arrays of string chars
s1c = [x for x in s1]
s1c.sort()
s2c = [x for x in s2]
s2c.sort()
i1 = 0
i2 = 0
same = 0
# "merge" strings, counting matches
while ( i1<len(s1c) and i2<len(s2c) ):
if s1c[i1]==s2c[i2]:
same += 2
i1 += 1
i2 += 1
elif s1c[i1] < s2c[i2]:
i1 += 1
else:
i2 += 1
# Return ratio of # of matching chars to total chars
return same/float(len(s1c)+len(s2c))

- 48,888
- 12
- 60
- 101
String similarity is a metric that depends on what you are measuring. Are you trying to match a mistyped word to the intended word in the dictionary? Comparing DNA or protein sequences? Trying to do document retrieval based on similarity to a search query? Doing fuzzy name matching? For each of these tasks, a different algorithm might be appropriate. If you're really asking a fully general question, you might start by reading about Levenshtein distance.

- 48,685
- 16
- 101
- 161
The SequenceMaster
from difflib
is almost what you're looking for. It hands out a score between 0 and 1, depending on how much they look like eachother.

- 6,068
- 4
- 31
- 43