27

I am a little puzzled by two different answers returned by SequenceMatcher depending on the order of the arguments. Why is it so?

Example

SequenceMatcher is not commutative:

>>> from difflib import SequenceMatcher
>>> SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo").ratio()
0.6086956521739131

>>> SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm").ratio()
0.5217391304347826
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
user2399453
  • 2,930
  • 5
  • 33
  • 60

3 Answers3

22

SequenceMatcher.ratio internally uses SequenceMatcher.get_matching_blocks to calculate the ratio, I will walk you through the steps to see how that happens:

SequenceMatcher.get_matching_blocks

Return list of triples describing matching subsequences. Each triple is of the form (i, j, n), and means that a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in i and j.

The last triple is a dummy, and has the value (len(a), len(b), 0). It is the only triple with n == 0. If (i, j, n) and (i', j', n') are adjacent triples in the list, and the second is not the last triple in the list, then i+n != i' or j+n != j'; in other words, adjacent triples always describe non-adjacent equal blocks.

ratio internally uses SequenceMatcher.get_matching_blocks 's results, and sums the sizes of all matched sequences returned bySequenceMatcher.get_matching_blocks. This is the exact source code from difflib.py:

matches = sum(triple[-1] for triple in self.get_matching_blocks())

The above line is critical, because the result of the above expression is used to compute the ratio. We'll see that shortly and how it impacts the calculation of the ratio.


>>> m1 = SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo")
>>> m2 = SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm")

>>> matches1 = sum(triple[-1] for triple in m1.get_matching_blocks())
>>> matches1
7
>>> matches2 = sum(triple[-1] for triple in m2.get_matching_blocks())
>>> matches2
6

As you can see, we have 7 and 6. These are simply the sums of the matched subsequences as returned by get_matching_blocks. Why does this matter? Here's why, the ratio is computed in the following way, (this is from difflib source code):

def _calculate_ratio(matches, length):
    if length:
        return 2.0 * matches / length
    return 1.0

length is len(a) + len(b) where a is the first sequence and b being the second sequence.

Okay, enough talk, we need actions:

>>> length = len("Ebojfm Mzpm") + len("Ebfo ef Mfpo") 
>>> m1.ratio()
0.6086956521739131
>>> (2.0 * matches1 / length)  == m1.ratio()
True

Similarly for m2:

>>> 2.0 * matches2 / length
0.5217391304347826 
>>> (2.0 * matches2 / length) == m2.ratio()
True

Note: Not all SequenceMatcher(None a,b).ratio() == SequenceMatcher(None b,a).ratio() are False, sometimes they can be True:

>>> s1 = SequenceMatcher(None, "abcd", "bcde").ratio()
>>> s2 = SequenceMatcher(None, "bcde", "abcd").ratio()
>>> s1 == s2
True

In case you're wondering why, this is because

sum(triple[-1] for triple in self.get_matching_blocks())

is the same for both SequenceMatcher(None, "abcd", "bcde") and SequenceMatcher(None, "bcde", "abcd") which is 3.

Victor Le Pochat
  • 241
  • 4
  • 11
GIZ
  • 4,409
  • 1
  • 24
  • 43
10

My answer does not provide exact details of the observed problem, but contains a general explanation of why such things may happen with loosely defined diffing methods.

Essentially everything boils down to the fact that, in the general case,

  1. more than one common subsequences of the same length can be extracted from a given pair of strings, and

  2. longer common subsequences may appear less natural to a human expert than a shorter one.

Since you are puzzled by this particular case let's analyze common subsequence identification on the following pair of strings:

  • my stackoverflow mysteries
  • mystery

To me, the natural match is "MYSTER", as follows:

my stackoverflow MYSTERies
.................MYSTERy..

However, the longest match fully covers the shorter of the two strings as follows:

MY STackovERflow mYsteries
MY.ST.....ER......Y.......

The drawback of such a match is that it introduces multiple matching sub-blocks whereas the (shorter) natural match is contiguous.

Therefore, diffing algorithms are tweaked so that their outputs are more pleasing to the final user. As a result, they are not 100% mathematically elegant and therefore don't possess properties that you would expect from a purely academic (rather than practical) tool.

The documentation of SequenceMatcher contains a corresponding note:

class difflib.SequenceMatcher

This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching.” The idea is to find the longest contiguous matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

Leon
  • 31,443
  • 4
  • 72
  • 97
  • Could you go a bit more into detail about the heuristics when a shorter contiguous sequence is prefered to a longer non-contigous one? – Martin Thoma Aug 06 '17 at 08:31
  • @MartinThoma The idea behind the approach used in `SequenceMatcher` is very simple and is found in the quote that provided in my answer: *The idea is to find the longest contiguous matching subsequence .... The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence.* – Leon Aug 07 '17 at 08:53
  • Sure, I've seen that. But lets say you have one continous subsequence of length 5 and alternatively you have 2 continous subsequences of length 4 (hence matching a total of 8). Which one would it take? – Martin Thoma Aug 07 '17 at 08:57
  • @MartinThoma I.e. on every level of recursion the algorithm identifies a single (longest) common contiguous subsequence for the two input strings. In my example with **mysteries** it happens to be the substring "**myster**". Then doing the same with the substrings to the left of "**myster**" (i.e. "**my stackoverflow** " and "") doesn't yield any match. No match is found in the substrings to the right of "**myster**" (i.e. "**ies**" and "**y**"), either. – Leon Aug 07 '17 at 08:58
  • @MartinThoma *But lets say you have one continous subsequence of length 5 and alternatively you have 2 continous subsequences of length 4 (hence matching a total of 8). Which one would it take?* On each step, the longest contiguous subsequence wins, hence it will first take the one with length 5. Subsequent steps of recursion may or may not match the other two subsequences of length 4, depending on the actual original strings (i.e. whether those shorter subsequences reside fully to the left/right of the longer subsequence or overlap with it). – Leon Aug 07 '17 at 09:03
-1
from difflib import SequenceMatcher

texto1 = 'BRASILIA~DISTRITO FEDERAL, DF'
texto2 = 'BRASILIA-DISTRITO FEDERAL, '

tamanho_texto1 = len(texto1)
tamanho_texto2 = len(texto2)
tamanho_tot = tamanho_texto1 + tamanho_texto2

tot = 0
if texto1 <= texto2:
    for x in range(len(texto1)):
        y = texto1[x]

        if y in texto2:
            tot += 1
else:
    for x in range(len(texto2)):
        y = texto2[x]

        if y in texto1:
            tot += 1
        
print('sequenceM = ',SequenceMatcher(None, texto1,     texto2).ratio())
print('Total calculado = ',2*tot/tamanho_tot)
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 30 '22 at 03:12