Python sequence matcher with custom matching function

Question

I have two lists and I want to find the matching elements using python difflib/sequence matcher, and it goes like this:

from difflib import SequenceMatcher
def match_seq(list1,list2):
    output=[]
    s = SequenceMatcher(None, list1, list2)
    blocks=s.get_matching_blocks()
    for bl in blocks:
        #print(bl, bl.a, bl.b, bl.size)
        for bi in range(bl.size):
            cur_a=bl.a+bi
            cur_b=bl.b+bi
            output.append((cur_a,cur_b))
    return output

so when I run it on two lists like this

list1=["orange","apple","lemons","grapes"]
list2=["pears", "orange","apple", "lemons", "cherry", "grapes"]
for a,b in match_seq(list1,list2):
    print(a,b, list1[a],list2[b])

I get this output:

(0, 1, 'orange', 'orange')
(1, 2, 'apple', 'apple')
(2, 3, 'lemons', 'lemons')
(3, 5, 'grapes', 'grapes')

but suppose that I do not want to match the identical items only, but rather use a matching function (for example, a function that can match orange to oranges or vice versa, or match the equivalent word in another language).

list3=["orange","apple","lemons","grape"]
list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]
list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]

Is there any option in difflib/sequence matcher, or any other python built-in library, that can provide this, so that I can match list3 and list 4, and also list3 and list5, the same way I did for list 1 and list2?

In general, can you think of a solution for this? I thought of replacing each word in the target list with the possible equivalents that I want to match, but this can be problematic because I may need to have multiple equivalents for each word, which may disturb the sequence.

score 3 · Answer 1 · answered Mar 12 '19 at 19:28

You have basically three solutions: 1) write your own implementation of diff; 2) hack the difflib module; 3) find a workaround.

Your own implementation

In case 1), you could look at this question and read a few books like CLRS or the books of Robert Sedgewick.

Hack the `difflib` module

In case 2), have a look at the source code: get_matching_blocks calls find_longest_match at line 479. In the core of find_longest_match, you have the b2j dictionary that maps elements of the list a to their indexes in the list b. If you overwrite this dictionary, you can achieve what you want. Here's the standard version:

>>> import difflib
>>> from difflib import SequenceMatcher
>>> list3 = ["orange","apple","lemons","grape"]
>>> list4 = ["pears", "oranges","apple", "lemon", "cherry", "grapes"]
>>> s = SequenceMatcher(None, list3, list4)
>>> s.get_matching_blocks()
[Match(a=1, b=2, size=1), Match(a=4, b=6, size=0)]
>>> [(b.a+i, b.b+i, list3[b.a+i], list4[b.b+i]) for b in s.get_matching_blocks() for i in range(b.size)]
[(1, 2, 'apple', 'apple')]

Here's the hacked version:

>>> s = SequenceMatcher(None, list3, list4)
>>> s.b2j
{'pears': [0], 'oranges': [1], 'apple': [2], 'lemon': [3], 'cherry': [4], 'grapes': [5]}
>>> s.b2j = {**s.b2j, 'orange':s.b2j['oranges'], 'lemons':s.b2j['lemon'], 'grape':s.b2j['grapes']}
>>> s.b2j
{'pears': [0], 'oranges': [1], 'apple': [2], 'lemon': [3], 'cherry': [4], 'grapes': [5], 'orange': [1], 'lemons': [3], 'grape': [5]}
>>> s.get_matching_blocks()
[Match(a=0, b=1, size=3), Match(a=3, b=5, size=1), Match(a=4, b=6, size=0)]
>>> [(b.a+i, b.b+i, list3[b.a+i], list4[b.b+i]) for b in s.get_matching_blocks() for i in range(b.size)]
[(0, 1, 'orange', 'oranges'), (1, 2, 'apple', 'apple'), (2, 3, 'lemons', 'lemon'), (3, 5, 'grape', 'grapes')]

This is not hard to automate, but I wouldn't recommend you that solution, since there is a very simple workaround.

A workaround

The idea is to group words by families:

families = [{"pears", "peras"}, {"orange", "oranges", "naranjas"}, {"apple", "manzana"}, {"lemons", "lemon", "limón"}, {"cherry", "cereza"}, {"grape", "grapes"}]

It's now easy to create a dictionary that takes maps every word of the family to one of those words (let's call it the main word):

>>> d = {w:main for main, *alternatives in map(list, families) for w in alternatives}
>>> d
{'pears': 'peras', 'orange': 'naranjas', 'oranges': 'naranjas', 'manzana': 'apple', 'lemon': 'lemons', 'limón': 'lemons', 'cherry': 'cereza', 'grape': 'grapes'}

Note that main, *alternatives in map(list, families) unpacks the family into a main word (the first of the list) and a list of alternatives using the star operator:

>>> head, *tail = [1,2,3,4,5]
>>> head
1
>>> tail
[2, 3, 4, 5]

You can then convert the lists to use only main words:

>>> list3=["orange","apple","lemons","grape"]
>>> list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]
>>> list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]
>>> [d.get(w, w) for w in list3]
['naranjas', 'apple', 'limón', 'grapes']
>>> [d.get(w, w) for w in list4]
['peras', 'naranjas', 'apple', 'limón', 'cereza', 'grapes']
>>> [d.get(w, w) for w in list5]
['peras', 'naranjas', 'apple', 'limón', 'cereza', 'uvas']

The expression d.get(w, w) will return d[w] if w is a key, else w itself. Hence, the words that belong to a family are converted to the main word of that family and the other words are left untouched.

Those lists are easy to compare with the difflib.

Important: the time complexity of the conversion of the lists is negligible compared to the diff algorithm, hence you should not see the difference.

Full code

As a bonus, the full code:

def match_seq(list1, list2):
    """A generator that yields matches of list1 vs list2"""
    s = SequenceMatcher(None, list1, list2)
    for block in s.get_matching_blocks():
        for i in range(block.size):
            yield block.a + i, block.b + i # you don't need to store the matches, just yields them

def create_convert(*families):
    """Return a converter function that converts a list
    to the same list with only main words"""
    d = {w:main for main, *alternatives in map(list, families) for w in alternatives}
    return lambda L: [d.get(w, w) for w in L]

families = [{"pears", "peras"}, {"orange", "oranges", "naranjas"}, {"apple", "manzana"}, {"lemons", "lemon", "limón"}, {"cherry", "cereza"}, {"grape", "grapes", "uvas"}]
convert = create_convert(*families)

list3=["orange","apple","lemons","grape"]
list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]
list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]

print ("list3 vs list4")
for a,b in match_seq(convert(list3), convert(list4)):
    print(a,b, list3[a],list4[b])

#  list3 vs list4
# 0 1 orange oranges
# 1 2 apple apple
# 2 3 lemons lemon
# 3 5 grape grapes

print ("list3 vs list5")
for a,b in match_seq(convert(list3), convert(list5)):
    print(a,b, list3[a],list5[b])

# list3 vs list5
# 0 1 orange naranjas
# 1 2 apple manzana
# 2 3 lemons limón
# 3 5 grape uvas

very nice solution, I like both the hacking and the workaround! — hmghaly, Mar 14 '19 at 13:09

cody · Answer 2 · 2019-03-13T01:01:20.593

Here's an approach that utilizes a class that inherits from UserString and overrides __eq__() and __hash__() such that strings deemed to be synonyms evaluate as equal:

import collections
from difflib import SequenceMatcher


class SynonymString(collections.UserString):
    def __init__(self, seq, synonyms, inverse_synonyms):
        super().__init__(seq)

        self.synonyms = synonyms
        self.inverse_synonyms = inverse_synonyms

    def __eq__(self, other):
        if self.synonyms.get(other) and self.data in self.synonyms.get(other):
            return True
        return self.data == other

    def __hash__(self):
        if str(self.data) in self.inverse_synonyms:
            return hash(self.inverse_synonyms[self.data])
        return hash(self.data)


def match_seq_syn(list1, list2, synonyms):

    inverse_synonyms = {
        string: key for key, value in synonyms.items() for string in value
    }

    list1 = [SynonymString(s, synonyms, inverse_synonyms) for s in list1]
    list2 = [SynonymString(s, synonyms, inverse_synonyms) for s in list2]

    output = []
    s = SequenceMatcher(None, list1, list2)
    blocks = s.get_matching_blocks()

    for bl in blocks:
        for bi in range(bl.size):
            cur_a = bl.a + bi
            cur_b = bl.b + bi
            output.append((cur_a, cur_b))
    return output


list3 = ["orange", "apple", "lemons", "grape"]
list5 = ["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]

synonyms = {
    "orange": ["oranges", "naranjas"],
    "apple": ["manzana"],
    "pears": ["peras"],
    "lemon": ["lemons", "limón"],
    "cherry": ["cereza"],
    "grape": ["grapes", "uvas"],
}

for a, b in match_seq_syn(list3, list5, synonyms):
    print(a, b, list3[a], list5[b])

Result (comparing lists 3 and 5):

0 1 orange naranjas
1 2 apple manzana
2 3 lemons limón
3 5 grape uvas

thank you so much, that's a smart idea of using userstring and ovverriding the eq — hmghaly, Mar 14 '19 at 13:10

ALFA · Answer 3 · 2019-03-11T16:40:15.640

So let's say you want to fill lists with elements that should be matched to each others. I didn't use any library but Generators. I'm not sure of the efficiency, I tried this code once but I think it should works pretty good.

orange_list = ["orange", "oranges"] # Fill this with orange matching words
pear_list = ["pear", "pears"]
lemon_list = ["lemon", "lemons"]
apple_list = ["apple", "apples"]
grape_list = ["grape", "grapes"]

lists = [orange_list, pear_list, lemon_list, apple_list, grape_list] # Put your matching lists inside this list

def match_seq_bol(list1, list2):
    output=[]
    for x in list1:
        for lst in lists:
            matches = (y for y in list2 if (x in lst and y in lst))
            if matches:
                for i in matches:
                    output.append((list1.index(x), list2.index(i), x,i))
    return output;

list3=["orange","apple","lemons","grape"]
list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]

print(match_seq_bol(list3, list4))

match_seq_bol() means match sequences based on lists.

The output matching list3 and list4 will be:

[
    (0, 1, 'orange', 'oranges'),
    (1, 2, 'apple', 'apple'),
    (2, 3, 'lemons', 'lemon'),
    (3, 5, 'grape', 'grapes')
]

the number of iterations for this solution will be the size of the first list multiplied by the size of the second list, so I don't think it will be scalable. — hmghaly, Mar 11 '19 at 19:27

Python sequence matcher with custom matching function

3 Answers3

Your own implementation

Hack the difflib module

A workaround

Full code

Hack the `difflib` module