check efficiently which part of a string is included in a larger string

Question

I am trying to link an object having a short description to another having another longer description among other candidates, e.g:

 short_string = "fo r ba oo"

 long_string1 = "foo bar baz"
 long_string2 = "soo az bn"

(Here the "long_strings" are not really longer than our short but this is just an example).

My goal would be to check for every long string how much of the shorter string they "cover" for n-grams (for the sake of our example we will take n=2), i.e which proportion of the short string finds one or several matches longer than 2 elements in a given long string.

Our matches should be (but they do not need to be displayed):

 # match between short and long_string1
 match1 = ['fo', 'ba', 'oo']
 # same for long_string2
 match2  = ['oo']

and our output would be which proportion of the short string matches inside the long string:

 # It would be better if we ignore the spaces
 prop_match1 = sum([len(i) for i in match1])/len(short_string) 
 # 0.6

 # same for prop_match2
 # outputs 0.2

Right now I am using the SequenceMatcher from the difflib library, but this has several issues:

from the docstrings SequenceMatcher looks for the longest match and then splits the strings to compare each side recursively, i.e if our longest match is right in the middle and there is a "foo" at the beginning on one string and the end of another, it won't be seen, e.g:

"foo barr baz" and "baz barr foo", only the " barr " match will be found since after finding the longest one, SequenceMatcher will compare ["foo", "baz"] on one side and ["baz", "foo"] on the other.
It is really slow ! I need to compare several small strings for each long string and find the best long string in terms of total coverage, and do so for several thousands of candidates. From the docstrings again, the cost of SequenceMatcher is between quadratic and linear, when I am looking at something that would be linear (That's why I had a look at suffixtrees which seem to be efficient for string searching, but I could not see how it would help my problem, nor find a correct implementation in Python3.)

Any help is welcome, either a simple algorithm I could implement myself or a library that could fix my problem !

edit: adding my code with the SequenceMatcher

        for long_string in list_long_strings:
            # Short formatting of the strings
            long = long_string.upper().replace('\n', ' ')
            matcher = difflib.SequenceMatcher(lambda x: x == ' ', long, '')
            coverage_total = 0
            for s in short_string_list:
                matcher.set_seq2(s)
                for block in matcher.get_matching_blocks():
                    if block[2] >= 2 :  # Only matches with lenght 2 characters or more
                      coverage_total += block[2]

I don't get everything, but `sum([len(i) for i in match1])` could be faster like `sum(map(len,match1))` (no list generated, full C underneath) — Jean-François Fabre, Feb 24 '17 at 10:29
Well, that is not actually the part that is the slowest, but that is a start, thank you ! — LoicM, Feb 24 '17 at 10:35
Out of curiosity, how does this fair? `[s for s in "fo r ba oo".split() if s in set(map(''.join, itertools.combinations('foo bar baz', 2)))]` only it doesn't account for `>2` — Steven Summers, Feb 24 '17 at 11:00
From my quick experiment, this is about equivalent with the %timeit magic (60e-8 for mine v 58e-8 for your solution). Another thing is that yours only look for matches of exact length of 2, but has not the problem of "splitting" I tried to describe above — LoicM, Feb 24 '17 at 11:09
What do you mean by splitting? The first dot point _compare each side recursively_? Or the fact I split short into `['fo', 'r', 'ba', 'oo']` at each space? — Steven Summers, Feb 24 '17 at 11:16
I mean that after finding the longest match, SequenceMatcher will compare each side of the string separated by this match recursively. Updated my example, hope that this is clearer — LoicM, Feb 24 '17 at 11:26

check efficiently which part of a string is included in a larger string

0 Answers0