4

I am trying to use SequenceMatcher.ratio() to get the similarity of two strings: "86418648" and "86488648":

>>> SequenceMatcher(None,"86418648","86488648").ratio()
0.5

The ratio returned is 0.5, which is much lower than I expected because there is only one character different in the two strings.

It seems that the ratio is calculated based on matching blocks. So I tried to run SequenceMatcher.get_matching_blocks():

>>> SequenceMatcher(None,"86418648","86488648").get_matching_blocks()
[Match(a=4, b=0, size=4), Match(a=8, b=8, size=0)]

But I expected the result to be:

[Match(a=0, b=0, size=3), Match(a=4, b=4, size=4), Match(a=8, b=8, size=0)]

Can anyone help to explain why it didn't match from the first 3 numbers "864"?

Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160
Jessie
  • 41
  • 1
  • 5

1 Answers1

2

SequenceMatcher.get_matching_blocks() works by repeated application of SequenceMatcher.find_longest_match() to as-yet-unmatched blocks of the two sequences.

Quoting from the docstring for find_longest_match():

Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
    alo <= i <= i+k <= ahi
    blo <= j <= j+k <= bhi
and for all (i',j',k') meeting those conditions,
    k >= k'
    i <= i'
    and if i == i', j <= j'

In other words, of all maximal matching blocks, return one that
starts earliest in a, and of all those maximal matching blocks that
start earliest in a, return the one that starts earliest in b.

In the case of the two sequences a = "86418648" and b = "86488648", the longest block in a matching a block in b is the single 8648 at a[4], and the earliest match for it in b is the first of two such possible matches, at b[0].

Once this match is decided, there are no longer any further matches such that, per the guarantee provided by SequenceMatcher.get_matching_blocks(), "The triples are monotonically increasing in i and in j".

For instance, matching the as-yet-unmatched 864 at a[0] with the as-yet-unmatched 864 at b[4] would require that i decrease as j increases (or vice versa), in violation of the aforementioned guarantee.

Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160
  • thanks for your reply. However, if I am comparing "164079131" and "169079131", it doesn't use the longest block for the first match. The longest block should be "079131". >>>SequenceMatcher(None,"164079131","169079131").get_matching_b‌​locks() >>>[Match(a=0, b=0, size=2), Match(a=3, b=3, size=6), Match(a=9, b=9, size=0)] – Jessie Jan 10 '18 at 22:19
  • @Jessie Sorry, I wasn't entirely clear: the point is that both *i* and *j* are guaranteed to increase together, and this isn't possible after the earliest instances of `8648` have been matched in each string. In your second example, that is possible, as the result shows. – Zero Piraeus Jan 10 '18 at 22:28
  • Note that the first match **found** is not necessarily the first match **returned**. Matches are [sorted](https://github.com/python/cpython/blob/master/Lib/difflib.py#L489) after they're found but before they're returned. The code is actually quite clear and well-documented; taking a look through it ought to help you understand what's happening. – Zero Piraeus Jan 10 '18 at 22:32