string similarity of optimal alignment

Question

Expected Behaviour of the algorithm

I have two strings a and b, with a being the shorter string. I would like to find the substring of b, that has the biggest similarity to a. The substring has to be of len(a), or has to be placed at the end of b. e.g. for the following two strings:

a = "aa"
b = "bbaba"

the possible substrings of b would be

"bb"
"ba"
"ab"
"ba"
"a"
""

The edit distance is defined as amount of Insertions and Deletion. Substitutions are not possible (Insertion + Deletion has to be used instead). The similarity between the two strings is calulated according to the following equation: norm = 1 - distance / (len(a) + len(substring)). So the substrings above would provide the following results:

"bb" -> 2 DEL + 2 INS -> 1 - 4 / 4 = 0
"ba" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"ab" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"ba" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"a"  ->         1 INS -> 1 - 1 / 3 = 0.66
""   ->         2 INS -> 1 - 2 / 2 = 0

So the algorithm should return 0.66.

Different implementations

A similar ratio is implemented by the Python library FuzzyWuzzy in the form of fuzz.partial_ratio. It calculates the ratio in two steps:

searches for matching subsequences in the longer sequence using difflib.SequenceMatcher.get_matching_blocks
calculates the ratio for substrings of len(shorter_string) starting at the matching subsequences and returns the maximum ratio

This is really slow, so it uses python-Levenshtein for this similarity calculation when it is available. This performs the same calculation based on the Levenshtein distance, which is faster. However in edge cases the calculated matching_blocks used for the ratio calculation is completely wrong (see issue 16), which does not make it a suitable replacement, when the correctness is relevant.

Current implementation

I currently use a C++ port of difflib in combination with a fast bitparallel implementation of the Levenshtein distance with the weights insertion=1, deletion=1 and substitution=2. The current implementation can be found here:

extracting matching_blocks: matching_blocks
calculating weighted Levenshtein: weighted Levenshtein
combining them to calculate the end ratio: partial_ratio

Question

Is there a faster algorithm to calculate this kind of similarity. Requirements are:

only uses Replacement/Insertion (or gives substitutions the weight 2, which has a similar effect)
allows a gap at the beginning of the longer string
allows a gap at the end of the longer string, as long as the remaining substring does not become shorter, than the length of the shorter string.
optimally it enforces, that the substring has a similar length (when it is not in the end), so it matches the behaviour of FuzzyWuzzy, but it would be fine when it allows longer subsequences to be matched aswell: e.g. for aaba:aaa this would mean, that it is allowed to use aaba as optimal subsequence instead of aab.

I've removed the C++ tag since this is an algorithm request, not a code one. — NathanOliver, Dec 23 '20 at 17:35
The most generic is Needlemann-Wunch, but that is O(N*M). Smitth-Waterman is faster, but does not compute the best alignment, nor the distance. Windowed/tiled variants (Hirchberg) for N-W exist. https://stackoverflow.com/q/65363769/905902 <<--Huh? — wildplasser, Dec 23 '20 at 20:05

string similarity of optimal alignment

Expected Behaviour of the algorithm

Different implementations

Current implementation

Question

0 Answers0