1

I know the implementation of the edit distance algorithm. By dynamic programming, we first fill the first column and first row and then the entries immediately right and below of the filled entries by comparing three paths from the left, above, and left above. While for the Ratcliff/Obershelp algorithm, we first extract the longest common substring out from the two strings, then we do recursive operations for the left side two sub-strings and right side two sub-strings until no characters are left.

Both of them can be utilized to calculate the similarity between two strings and transform one string into another using four operations: delete, replace, copy, and insert.

But I wonder when to use which between SequenceMatcher in edit distance and that in difflib?

Here is what I found on the internet that makes me think that this question would also benefit others:

  1. In the documentation of edit distance it reads that

Similar to the difflib SequenceMatcher, but uses Levenshtein/edit distance.

  1. In this answer to a question on calculating edit distance, an answer on Ratcliff/Obershelp algorithm was provided.
  2. There are only a few resources about the Ratcliff/Obershelp algorithm, let alone its comparison to edit distance that I thought is the most well known string alignment algorithm.

So far as I know, I have the following ideas:

  1. I find that edit distance and the Ratcliff/Obershelp algorithm can both be used for spell checking. But when to use which?

  2. I thought the edit distance is employed to find the minimal edit sequence while the Ratcliff/Obershelp algorithm yields matches that "look right" to people. However, 'look right' seems too vague a term, especially in real world applications. What's more, when is the minimum edit sequence a must/preferred?

Any suggestions would be highly appreciated, and thanks in advance.

Lerner Zhang
  • 6,184
  • 2
  • 49
  • 66

1 Answers1

1

"Looks right to people" needn't be all that vague. Search the web for discussion of why, e.g., the very widely used git source control system added "patience" and "histogram" differencing algorithms, as options. Variations of "minimal edit distance" routinely produce diffs that are jarring to humans, and I'm not going to reproduce examples here that are easily found by searching.

From a formal perspective, Levenshtein is more in line with what a mathematician means by "distance". Chiefly, difflib's .ratio() can depend on the order of the arguments passed to it, but Levenshtein is insensitve to order:

>>> import difflib
>>> difflib.SequenceMatcher(None, "tide", "diet").ratio()
0.25
>>> difflib.SequenceMatcher(None, "diet", "tide").ratio()
0.5

For the rest, I don't think you're going to get crisp answers. There are many notions of "similarity", not just the two you mentioned, and they all have their fans. "Minimal" was probably thought to be more important back when disk space and bandwidth were scarce and expensive.

The physical realities constraining genetic mutation have made measures that take into account spatial locality much more important in that field - doesn't matter one whit if it's "minimal" if it's also physically implausible ;-) Terms to search for: Smith–Waterman, and Needleman–Wunsch.

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • 1
    The commutative property and the disk space and bandwidth limitation make edit distance preferrable. Thanks. I am reading the 7th chapter of Maxime Crochemore's Algorithms On Strings and will come back to this question after reading it. – Lerner Zhang Oct 10 '21 at 04:07