3

If I have some distance which I do not want to exceed. Example = 2. Do I can break from algoritm before its complete completion knowing the minimum allowable distance?

Perhaps there are similar algorithms in which it can be done.

It is necessary for me to reduce the time of work programs.

  • 1
    Not sure if I understand correctly. Levenshtein distance has a runtime complexity of O(n*m) where n and m are the length of the two words. Are you asking for a lower bound such that you can avoid performing the distance computation? One simple lower bound is LD(a, b) >= |len(a)-len(b)| having only a O(1) complexity. – SaiBot Feb 21 '18 at 08:51
  • @SaiBot The question is whether it is possible to exit the algorithm before it is fully executed knowing the minimum permissible distance –  Feb 21 '18 at 09:06
  • Yes you can do early stopping if every cost in row or column is already larger than 2. Additionally, you can use the lower bound I proposed. However, if you want to find all words that have a lower distance than 2 to a query in a large set of words there are probably better ways of doing this, e.g., through precomputation. – SaiBot Feb 21 '18 at 09:21

2 Answers2

5

Yes you can and it does reduce the complexity.

The main thing to observe is that levenstein_distance(a,b) >= |len(a) - len(b)| That is the distance can't be less than the difference in the lengths of the strings. At the very minimum you need to add characters to make them the same length.

Knowing this you can ignore all the cells in the original matrix where |i-j| > max_distance. So you can modify your loops from

for (i in 0 -> len(a))
   for (j in 0 -> len(b))

to

for (i in 0-> len(a))
   for (j in max(0,i-max_distance) -> min(len(b), i+max_distance)) 

You can keep the original matrix if it's easier for you, but you can also save space by having a matrix of (len(a), 2*max_distance) and adjusting the indices.

Once every cost you have in the last row > max_distance you can stop the algorithm.

This will give you O(N*max_distance) complexity. Since your max_distance is 2 the complexity is almost linear. You can also bail at the start is |len(a)-len(b)| > max_distance.

yassin
  • 642
  • 2
  • 7
  • 18
Sorin
  • 11,863
  • 22
  • 26
  • I have max distance value = 4 and two strings - "#kjhEISV" and "#kjhDRSGBG". Final distance = 5 but I not break algorithm use "last row > max_distance you can stop the algorithm." –  Feb 28 '18 at 15:39
  • @noobprogrammer You interpret it wrong. You said that your max distance value = 4. So you don't care if the distance is 5 or more. The algorithm will break and tell you the distance is more than 4. The algorithm will return the correct value only if it's within max distance. You can make a nicer wrapper to return an error code or throw an exception or whatever it's better in your context. – Sorin Mar 06 '18 at 13:55
  • @noobprogrammer The fact that you get a 5 as a result you need to interpret as the distance between the strings is more than max distance (4). – Sorin Mar 06 '18 at 13:56
1

If you do top-down dynamic programming/recursion + memoization, you could pass the current size as an extra parameter and return early if it exceeds 2. But I think this will be inefficient because you will revisit states.

If you do bottom-up dp, you will fill row by row (you only have to keep the last and current row). If the last row only has entries greater than 2, you can terminate early.

Modify your source code according to my comment:

for (var i = 1; i <= source1Length; i++)
{
                for (var j = 1; j <= source2Length; j++)
                {
                    var cost = (source2[j - 1] == source1[i - 1]) ? 0 : 1;

                    matrix[i, j] = Math.Min(
                        Math.Min(matrix[i - 1, j] + 1, matrix[i, j - 1] + 1),
                        matrix[i - 1, j - 1] + cost);
                }
                // modify here:
                // check here if matrix[i,...] is completely > 2, if yes, break

}
yassin
  • 642
  • 2
  • 7
  • 18
  • I use this algorithm - https://gist.github.com/Davidblkx/e12ab0bb2aff7fd8072632b396538560. Last row i will know after fully executed this method –  Feb 21 '18 at 09:12
  • @noobprogrammer ok, you keep all rows, but the last row I reference is the one you last update. I added a part of your src code, hope it helps – yassin Feb 21 '18 at 09:24
  • You shouldn't iterate through the entire second string, only through a narrow window around i `(i-max_distance, i+max_distance)` you will only find larger values outside it. As it stands you still have O(N*M) complexity. – Sorin Feb 21 '18 at 09:51
  • @Sorin yes, it doesn't change the complexity. your idea is great – yassin Feb 21 '18 at 09:59