I am trying to link an object having a short description to another having another longer description among other candidates, e.g:
short_string = "fo r ba oo"
long_string1 = "foo bar baz"
long_string2 = "soo az bn"
(Here the "long_strings" are not really longer than our short but this is just an example).
My goal would be to check for every long string how much of the shorter string they "cover" for n-grams (for the sake of our example we will take n=2), i.e which proportion of the short string finds one or several matches longer than 2 elements in a given long string.
Our matches should be (but they do not need to be displayed):
# match between short and long_string1
match1 = ['fo', 'ba', 'oo']
# same for long_string2
match2 = ['oo']
and our output would be which proportion of the short string matches inside the long string:
# It would be better if we ignore the spaces
prop_match1 = sum([len(i) for i in match1])/len(short_string)
# 0.6
# same for prop_match2
# outputs 0.2
Right now I am using the SequenceMatcher from the difflib library, but this has several issues:
from the docstrings SequenceMatcher looks for the longest match and then splits the strings to compare each side recursively, i.e if our longest match is right in the middle and there is a "foo" at the beginning on one string and the end of another, it won't be seen, e.g:
"foo barr baz" and "baz barr foo", only the " barr " match will be found since after finding the longest one, SequenceMatcher will compare ["foo", "baz"] on one side and ["baz", "foo"] on the other.
It is really slow ! I need to compare several small strings for each long string and find the best long string in terms of total coverage, and do so for several thousands of candidates. From the docstrings again, the cost of SequenceMatcher is between quadratic and linear, when I am looking at something that would be linear (That's why I had a look at suffixtrees which seem to be efficient for string searching, but I could not see how it would help my problem, nor find a correct implementation in Python3.)
Any help is welcome, either a simple algorithm I could implement myself or a library that could fix my problem !
edit: adding my code with the SequenceMatcher
for long_string in list_long_strings:
# Short formatting of the strings
long = long_string.upper().replace('\n', ' ')
matcher = difflib.SequenceMatcher(lambda x: x == ' ', long, '')
coverage_total = 0
for s in short_string_list:
matcher.set_seq2(s)
for block in matcher.get_matching_blocks():
if block[2] >= 2 : # Only matches with lenght 2 characters or more
coverage_total += block[2]