4

Expected Behaviour of the algorithm

I have two strings a and b, with a being the shorter string. I would like to find the substring of b, that has the biggest similarity to a. The substring has to be of len(a), or has to be placed at the end of b. e.g. for the following two strings:

a = "aa"
b = "bbaba"

the possible substrings of b would be

"bb"
"ba"
"ab"
"ba"
"a"
""

The edit distance is defined as amount of Insertions and Deletion. Substitutions are not possible (Insertion + Deletion has to be used instead). The similarity between the two strings is calulated according to the following equation: norm = 1 - distance / (len(a) + len(substring)). So the substrings above would provide the following results:

"bb" -> 2 DEL + 2 INS -> 1 - 4 / 4 = 0
"ba" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"ab" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"ba" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"a"  ->         1 INS -> 1 - 1 / 3 = 0.66
""   ->         2 INS -> 1 - 2 / 2 = 0

So the algorithm should return 0.66.

Different implementations

A similar ratio is implemented by the Python library FuzzyWuzzy in the form of fuzz.partial_ratio. It calculates the ratio in two steps:

  1. searches for matching subsequences in the longer sequence using difflib.SequenceMatcher.get_matching_blocks

  2. calculates the ratio for substrings of len(shorter_string) starting at the matching subsequences and returns the maximum ratio

This is really slow, so it uses python-Levenshtein for this similarity calculation when it is available. This performs the same calculation based on the Levenshtein distance, which is faster. However in edge cases the calculated matching_blocks used for the ratio calculation is completely wrong (see issue 16), which does not make it a suitable replacement, when the correctness is relevant.

Current implementation

I currently use a C++ port of difflib in combination with a fast bitparallel implementation of the Levenshtein distance with the weights insertion=1, deletion=1 and substitution=2. The current implementation can be found here:

Question

Is there a faster algorithm to calculate this kind of similarity. Requirements are:

  • only uses Replacement/Insertion (or gives substitutions the weight 2, which has a similar effect)
  • allows a gap at the beginning of the longer string
  • allows a gap at the end of the longer string, as long as the remaining substring does not become shorter, than the length of the shorter string.
  • optimally it enforces, that the substring has a similar length (when it is not in the end), so it matches the behaviour of FuzzyWuzzy, but it would be fine when it allows longer subsequences to be matched aswell: e.g. for aaba:aaa this would mean, that it is allowed to use aaba as optimal subsequence instead of aab.
maxbachmann
  • 2,862
  • 1
  • 11
  • 35

0 Answers0