Expected Behaviour of the algorithm
I have two strings a
and b
, with a
being the shorter string. I would like to find the substring of b
, that has the biggest similarity to a
. The substring has to be of len(a)
, or has to be placed at the end of b
.
e.g. for the following two strings:
a = "aa"
b = "bbaba"
the possible substrings of b would be
"bb"
"ba"
"ab"
"ba"
"a"
""
The edit distance is defined as amount of Insertions and Deletion. Substitutions are not possible (Insertion + Deletion has to be used instead). The similarity between the two strings is calulated according to the following equation: norm = 1 - distance / (len(a) + len(substring))
.
So the substrings above would provide the following results:
"bb" -> 2 DEL + 2 INS -> 1 - 4 / 4 = 0
"ba" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"ab" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"ba" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"a" -> 1 INS -> 1 - 1 / 3 = 0.66
"" -> 2 INS -> 1 - 2 / 2 = 0
So the algorithm should return 0.66.
Different implementations
A similar ratio is implemented by the Python library FuzzyWuzzy in the form of fuzz.partial_ratio
. It calculates the ratio in two steps:
searches for matching subsequences in the longer sequence using difflib.SequenceMatcher.get_matching_blocks
calculates the ratio for substrings of len(shorter_string) starting at the matching subsequences and returns the maximum ratio
This is really slow, so it uses python-Levenshtein for this similarity calculation when it is available. This performs the same calculation based on the Levenshtein distance, which is faster. However in edge cases the calculated matching_blocks used for the ratio calculation is completely wrong (see issue 16), which does not make it a suitable replacement, when the correctness is relevant.
Current implementation
I currently use a C++ port of difflib in combination with a fast bitparallel implementation of the Levenshtein distance with the weights insertion=1, deletion=1 and substitution=2. The current implementation can be found here:
- extracting matching_blocks: matching_blocks
- calculating weighted Levenshtein: weighted Levenshtein
- combining them to calculate the end ratio: partial_ratio
Question
Is there a faster algorithm to calculate this kind of similarity. Requirements are:
- only uses Replacement/Insertion (or gives substitutions the weight 2, which has a similar effect)
- allows a gap at the beginning of the longer string
- allows a gap at the end of the longer string, as long as the remaining substring does not become shorter, than the length of the shorter string.
- optimally it enforces, that the substring has a similar length (when it is not in the end), so it matches the behaviour of FuzzyWuzzy, but it would be fine when it allows longer subsequences to be matched aswell: e.g. for
aaba
:aaa
this would mean, that it is allowed to useaaba
as optimal subsequence instead ofaab
.