15

Given 2 strings s and t. I need to find for each substring in s edit distance(Levenshtein distance) to t. Actually I need to know for each i position in s what is the minimum edit distance for all substrings started at position i.

For example:

t = "ab"    
s = "sdabcb"

And I need to get something like:

{2,1,0,2,2}

Explanation:

1st position:
distance("ab", "sd") = 4 ( 2*subst )
distance("ab", "sda") = 3( 2*delete + insert )
distance("ab", "sdab") = 2 ( 2 * delete)
distance("ab", "sdabc") = 3 ( 3 * delete)
distance("ab", "sdabcb") = 4 ( 4 * delete)
So, minimum is 2

2nd position:
distance("ab", "da") = 2 (delete + insert)
distance("ab", "dab") = 1 (delete)
distance("ab", "dabc") = 2 (2*delete)
....
So, minimum is 1

3th position:
distance("ab", "ab") = 0
...
minimum is 0

and so on.

I can use brute force algorithm to solve this task, of course. But is there faster algorithm?

Thanks for help.

Ivan Bianko
  • 1,749
  • 15
  • 22
  • 1
    I know that your answer `{2,1,**0,2**,2}` is wrong, because adjacent numbers can differ by at most 1: if there is a substring `s[i..j]` with minimum edit distance `k` to `t`, then the substring `s[(i+1)..j]` can match `t` with cost at most `k+1` by making the first edit operation an insertion of `s[i]` at the very start of the string. In your example, for the 4th position, `distance("ab", "b") = 1` (1 insert) and for the 5th, `distance("ab", "cb") = 1` (1 subst). – j_random_hacker Nov 16 '11 at 03:46
  • @Anderson Green Just to clarify, are you still looking (as in the original question) only for the minimum edit distance from each position in `s`, or something more? – kcsquared Apr 22 '22 at 23:26
  • @kcsquared, Yes, I want to find the substring with the minimum edit distance. – Anderson Green Apr 23 '22 at 14:09

2 Answers2

19

To find substrings in a given string is very easy. You take the normal Levenshtein algorithm and modify it slightly.

FIRST: Instead of filling the first row of the matrix with 0,1,2,3,4,5,... you fill it entirely with zeros. (green rectangle)

SECOND: Then you run the algorithm.

THIRD: Instead of returning the last cell of the last row you search for the smallest value in the last row and return it. (red rectangle)

Example: needle: "aba", haystack: "c abba c" --> result = 1 (converting abba -> aba)

enter image description here

I tested it and it works.

This is much faster than your suggestion of stepping character by character through the string as you do in your question. You only create the matrix once.

Elmue
  • 7,602
  • 3
  • 47
  • 57
  • I don't exactly understand how this modified algorithm was implemented: does an implementation of the algorithm exist? (There are [several different algorithms](https://en.wikipedia.org/wiki/Levenshtein_distance#Computation) to compute the Levenshtein distance, so I don't know which algorithm it was based on.) – Anderson Green Apr 21 '22 at 18:06
  • @AndersonGreen change `d[0, j] := j` to `d[0, j] := 0`, for example. But does this answer give you what you need? – David Eisenstat Apr 21 '22 at 20:54
  • @DavidEisenstat It looks like it was based on [this algorithm](https://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_full_matrix), but I hope it's possible to improve it further. There is a [more efficient algorithm](https://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows) that computes the Levenshtein distance using only two matrix rows. – Anderson Green Apr 22 '22 at 13:55
  • @AndersonGreen change `v0[i] = i` to `v0[i] = 0` for that one. – David Eisenstat Apr 22 '22 at 19:23
  • 1
    I also found [an implementation of this substring-matching algorithm](https://github.com/agnivade/levenshtein/issues/12#issue-510847466) in Go. – Anderson Green Apr 25 '22 at 18:48
  • 2
    @AndersonGreen the Go implementation solves your purpose then? Or is it missing something? – Abhinav Mathur Apr 26 '22 at 02:43
  • @AbhinavMathur Yes, this is the algorithm that I wanted to find. – Anderson Green Apr 27 '22 at 13:40
  • @AndersonGreen so I'm assuming there's no point of adding an answer? Or there's a particular language for which you've opened the bounty? – Abhinav Mathur Apr 27 '22 at 13:42
5

The Wagner-Fischer algorithm gives you the answer for all prefixes "for free".

http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm

The last row of the Wagner-Fischer matrix contains the edit distance from each prefix of s to t.

So as a first crack at your problem, for each i, run Wagner-Fischer and select the smallest element in the last row.

I will be curious to see if anyone else knows (or can find) a better approach.

Nemo
  • 70,042
  • 10
  • 116
  • 153
  • Thanks, but I meant this solution as brute force... and I hope that exists better solution (related time complexity). – Ivan Bianko Nov 15 '11 at 18:17
  • I doubt that anybody will understand your answer without an example. – Elmue Jun 14 '16 at 06:36
  • if you're referring to `s` and `t` mentioned in the wiki, the last row contains edit distance from `s` to each prefix of `t`, not a distance from each prefix of `s` to `t` – mangusta Feb 12 '19 at 10:20