7

In the python difflib library, is the SequenceMatcher class behaving unexpectedly, or am I misreading what the supposed behavior is?

Why does the isjunk argument seem to not make any difference in this case?

difflib.SequenceMatcher(None, "AA", "A A").ratio() return 0.8

difflib.SequenceMatcher(lambda x: x in ' ', "AA", "A A").ratio() returns 0.8

My understanding is that if space is omitted, the ratio should be 1.

bluelogic
  • 71
  • 3

2 Answers2

2

This is happening because the ratio function uses total sequences' length while calculating the ratio, but it doesn't filter elements using isjunk. So, as long as the number of matches in the matching blocks results in the same value (with and without isjunk), the ratio measure will be the same.

I assume that sequences are not filtered by isjunk because of performance reasons.

def ratio(self):   
    """Return a measure of the sequences' similarity (float in [0,1]).

    Where T is the total number of elements in both sequences, and
    M is the number of matches, this is 2.0*M / T.
    """

    matches = sum(triple[-1] for triple in self.get_matching_blocks())
    return _calculate_ratio(matches, len(self.a) + len(self.b))

self.a and self.b are the strings (sequences) passed to the SequenceMatcher object ("AA" and "A A" in your example). The isjunk function lambda x: x in ' ' is only used to determine the matching blocks. Your example is quite simple, so the resulting ratio and matching blocks are the same for both calls.

difflib.SequenceMatcher(None, "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]

difflib.SequenceMatcher(lambda x: x == ' ', "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]

Same matching blocks, the ratio is: M = 2, T = 6 => ratio = 2.0 * 2 / 6

Now consider the following example:

difflib.SequenceMatcher(None, "AA ", "A A").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=3, size=0)]

difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=3, b=3, size=0)]

Now matching blocks are different, but the ratio will be the same because the number of matches is still equal:

When isjunk is None: M = 2, T = 6 => ratio = 2.0 * 2 / 6

When isjunk is lambda x: x == ' ': M = 1 + 1, T = 6 => ratio = 2.0 * 2 / 6

Finally, a different number of matches:

difflib.SequenceMatcher(None, "AA ", "A A ").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=4, size=0)]

difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A ").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=2), Match(a=3, b=4, size=0)]

The number of matches is different

When isjunk is None: M = 2, T = 7 => ratio = 2.0 * 2 / 7

When isjunk is lambda x: x == ' ': M = 1 + 2, T = 6 => ratio = 2.0 * 3 / 7

Jose S
  • 620
  • 3
  • 8
  • 22
  • 5
    How to then match and ignore some simple lines of characters like the one in question's example if isjunk doesn't work that way? More specifically is there a build in function/method to simply ignore some characters when calculating similarity ratio? – appwizcpl Sep 23 '19 at 04:00
  • Agreed, it's a strange property to use when it doesn't "do anything" as such – CutePoison Jun 16 '22 at 11:42
0

You can remove the characters from the string before sequencing it

def withoutJunk(input, chars):
    return input.translate(str.maketrans('', '', chars))

a = withoutJunk('AA', ' ')
b = withoutJunk('A A', ' ')
difflib.SequenceMatcher(None, a, b).ratio()
# -> 1.0
theRemix
  • 2,154
  • 16
  • 16