Percentage of how similar strings are in Python?

Question

I don't know how to do a program that gives a percentage of how similar two strings of the same length are.

For example, for abcd and abce it should give 75%.

The order matters, I don't want that it gives me that abcd and dcab have a 100%.

I know that Levenshtein module does that, but I want a program that does it.

possible duplicate of [Good Python modules for fuzzy string comparison?](http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison) — jscs, Dec 22 '13 at 04:28
Not a duplicate, please read carefully, in that thread they ask for a module, and I want a program that does it! — user3103718, Dec 22 '13 at 04:31
Well, asking for a complete program to be handed to you isn't really acceptable here. You need to demonstrate some understanding of the problem. So I suggested that as a pointer for your research. — jscs, Dec 22 '13 at 04:38
Further, the Levenshtein module _is_ "a program", that you've already said does what you want. The top-voted answer at the question I've linked shows the function call you need to make. — jscs, Dec 22 '13 at 04:48

score 8 · Answer 1 · answered Dec 22 '13 at 04:37

8

>>> from difflib import SequenceMatcher
>>> SequenceMatcher(None, 'abcd', 'abce').ratio()
0.75

Read the docs for more. You can read the description in the docs to figure out how to do it yourself, but you're going to end up coding some kind of alignment algorithm from scratch.

answered Dec 22 '13 at 04:37

Tim Peters

67,464
13
126
132

I believe by looking at his examples, he is actually looking for `longets common substring` – Abhijit Dec 22 '13 at 04:42
@Abhijit, yes, you said that in your answer ;-) But I don't believe it. The OP implied that the Levenshtein module does what he wants, but Levenshtein isn't longest common anything - it's "minimal edit distance", which isn't the same thing. The `difflib` algorithm is akin to a longest common substring applied recursively. – Tim Peters Dec 22 '13 at 04:45
1

But then, `SequenceMatcher(None, 'abcd', 'abce').ratio() = 0.25` and not 0 as shown in his example. So the question is confusing, and OP is not aware of what he actually wants :-) – Abhijit Dec 22 '13 at 04:50
I agree the question is unclear. As my answer says, `SequenceMatcher(None, 'abcd', 'abce').ratio()` returns `0.75` (not `0.25`), and the OP said they wanted `75%`. I don't see anything the OP said about any example returning `0`. The only other I see is that they **don't** want `abcd` and `dcab` to return 100%. – Tim Peters Dec 22 '13 at 05:04

Abhijit · Answer 2 · 2013-12-22T04:47:31.073

The word similar has varied context, but looking at your examples, I am definite, you are looking for the

Match% = 2* Longest_Common_Substring(a, b) / (len(a) + len(b)) * 100

Just google for Longest Common Substring and you are sure to find loads of Python Implementation.

One such Python Implementation from Wikibook : Algorithm Implementation/Strings/Longest common substring is as follows

def longest_common_substring(s1, s2):
    m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in xrange(1, 1 + len(s1)):
        for y in xrange(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return s1[x_longest - longest: x_longest]

wrapping it over with a similarity function, the result conforms to your expectation

>>> def similarity(s1, s2):
     return 2. * len(longest_common_substring(s1, s2)) / (len(s1) + len(s2)) * 100

>>> similarity("abcd","abce")
75.0
>>> similarity("abcd","dcba")
25.0

This seems like a good approach, but we can't use this for practical purpose whe nstring size is huge. It is very slow. Instead we can use SequenceMatcher's longest substring method to find substring, and use the wrapper around it as you have defined similarity() method. — Debasish Kanhar, Feb 11 '19 at 11:17

score 0 · Answer 3 · answered Dec 22 '13 at 04:31

0

Assuming that s1 and s2 have the same length:

from numpy import mean
mean([s1[i]==s2[i] for i in xrange(len(s1))])

answered Dec 22 '13 at 04:31

James King

6,229
3
25
40

score 0 · Answer 4 · answered Dec 22 '13 at 04:35

How about this:

>>> a = list('abce')
>>> b = list('abcd')
>>> ( 100 - (sum(i != j for i, j in zip(a, b)) / float(len(a))) * 100 )
75.0
>>> a = list('abce')
>>> b = list('bdce')
>>> ( 100 - (sum(i != j for i, j in zip(a, b)) / float(len(a))) * 100 )
50.0
>>>

score 0 · Answer 5 · answered Dec 22 '13 at 04:35

0

Wikipedia has pseudocode for Levenshtein_distance or Edit Distance It is a simple algorithm. Why don't you give it a try and ask specific question if you are stuck.

answered Dec 22 '13 at 04:35

Saher Ahwal

9,015
32
84
152

Percentage of how similar strings are in Python?

5 Answers5

Linked