0

So recently I embarked on a coding project to try create some code to mathematically create a way to depict how similar two strings are. On my research I found plenty of examples online to help me create the code I desired. I have an error with one I found which in my running of it is creating, what I am calling, exponential loops. It's not an infinite loop, it runs and it solves problems, but the longer the strings I pass into the method the exponentially longer the method runs for. The code is here below

public static int LevenshteinDistance(this string source, string target)
    {
        Console.WriteLine("Start of method");
        if (source.Length == 0) { return target.Length; }
        if (target.Length == 0) { return source.Length; }

        int distance = 0;

        Console.WriteLine("Distance creation");
        if (source[source.Length - 1] == target[target.Length - 1]) { distance = 0; }
        else { distance = 1; }

        Console.WriteLine("Recursive MethodCall");
        return Math.Min(Math.Min(LevenshteinDistance(source.Substring(0, source.Length - 1), target) + 1,
                                 LevenshteinDistance(source, target.Substring(0, target.Length - 1))) + 1,
                                 LevenshteinDistance(source.Substring(0, source.Length - 1), target.Substring(0, target.Length - 1)) + distance);
    }

So, for smaller strings this method runs just fine however, when I start to pass in things like addresses or long names it takes a long time. So long in fact I disbanded this method entirely and wrote another one (i'll provide this if anybody wants it but it's not important for the question) which serves my purpose but in the interest of solving problems and learning, I tried to figure out exactly why this one was taking so long when coded recursively. I stepped through my code with pen and paper, in debug mode and when I get to the recursive call here

return Math.Min(Math.Min(LevenshteinDistance(source.Substring(0, source.Length - 1), target) + 1,
                                 LevenshteinDistance(source, target.Substring(0, target.Length - 1))) + 1,
                                 LevenshteinDistance(source.Substring(0, source.Length - 1), target.Substring(0, target.Length - 1)) + distance);
    }

And I can work out what is happening solidly for these parts

return Math.Min(***Math.Min(LevenshteinDistance(source.Substring(0, source.Length - 1), target) + 1,
                                 LevenshteinDistance(source, target.Substring(0, target.Length - 1))) + 1,***
                                 LevenshteinDistance(source.Substring(0, source.Length - 1), target.Substring(0, target.Length - 1)) + distance);
    }

I tried to bolden and italicize the part I mean, it's the part with '***' around it. Getting to that part I understand but then the next part, the third LevenshteinDistance Call and it's first iteration I start to lose focus on recursion and how it works and where my confusion starts. This part

LevenshteinDistance(source.Substring(0, source.Length - 1), target.Substring(0, target.Length - 1)) + distance);
    }

The code from what I gather once it reaches this call then starts to do the first LevenshteinDistance call all over again and reiterating over the calls it just performed. This is what I'm being confused on. Why does it call the first part of the recursive call again and then the second and what is causing the time to complete the code to lengthen exponentially?

Note: The time may not be exponentially in the literal sense, the times for running a short string comparison were sub-second and once the strings got a little longer it jumper from sub-second -> ~15 seconds -> 2:30 mins -> 35 minutes

2nd Note: Tagged as infinite loop as exponential loop doesn't exist and this somewhat is close to it.

Dexter Whelan
  • 414
  • 3
  • 15

4 Answers4

4

Because it's recursive, not just single recursion (like you would use for a treeview), this puppy passes the return result of itself 3 recursive calls!

That's why your seeing exponential timing increases with longer strings.

Jeremy Thompson
  • 61,933
  • 36
  • 195
  • 321
  • 2
    the "error" isn't recursion itself, some of the most efficient code is recursive method but is the type of recursion you have created, it's recurse more time more longer is the input string. – Othello .net dev Apr 06 '16 at 14:23
  • Eventually you'll hit a [SO] ;) – Jeremy Thompson Apr 06 '16 at 14:24
  • thank you Jeremy, where can I find a place to learn how to draw a 'treeview' of a program? Maybe i might try employ this tactic next time i encounter a similar problem – Dexter Whelan Apr 12 '16 at 08:55
  • You just use the call stack, Ctrl + Alt + C, and you can halt the program using conditional break points to see the call stacks depth at given times during the operation. Generally if the call stack is > 150 calls to the same (recursive) function then you need to check that out. – Jeremy Thompson Apr 12 '16 at 08:58
1

For a pair of strings (source, target) of size n and m you are doing 3 recursive calls to the function.

LevenshteinDistance(source[0..n - 1], target)
LevenshteinDistance(source, target[0..m - 1])
LevenshteinDistance(source[0..n - 1], target[0..m - 1])

So you are creating a tree with 3 children for each node and with a min depth of min(n,m) and max depth max(m,n)

So, in each level of this tree, there are 3 times the amount of nodes than the previous level:

0
|- 1
    |- 2
    |- 2
    |- 2
|- 1
    |- 2
    |- 2
    |- 2
|- 1
    |- 2
    |- 2
    |- 2

And so on.

So, for the each level k in your tree, you have 3k nodes. So the complexity of your algorithm is O(3max(n,m)) which is exponential.

Edgar Hernandez
  • 4,020
  • 1
  • 24
  • 27
1

The standard recursive algorithm computes values multiple times.

Here is a small example of two strings of size 3, the sequence of calculation is

D[2, 2] = min(recurse(1, 2), recurse(2, 1), recurse(1, 1) + eq(1, 1))

following the 3 recursion calls you get

//first recursive call
D[1, 2] = min(recurse(0, 2), recurse(1, 1), recurse(0, 1))

//second recursive call
D[2, 1] = min(recurse(1, 1), recurse(2, 0), recurse(1, 0))

//third recursive call
D[1, 1] = min(recurse(0, 1), recurse(1, 0), recurse(0, 0))

Already here you see that we have multiple calculations of the same value.

The complexity gets as you already figured out, exponential. More precise Θ(3^min(m, n)). Here is a good answer that explains and calculates the complexity.

However, this can be overcome by using a cache for the computed values and check the cache if the value has already been computed. This method is also called Memoization and the complexity then becomes Θ(nm).

Community
  • 1
  • 1
gustf
  • 1,959
  • 13
  • 20
1

Notice that you are making 3 recursive calls for each call. My math is slightly off, but roughly you are making 3 calls for every level (in the recursive call tree). The level corresponds to the min number of characters between the 2 input string.

For a LevenshteinDistance("a", "x") call you will end up making 4 calls (first call + 3 recursive calls)

For a LevenshteinDistance("ab", "xy") call you will end up making 19 calls (first call + 3 recursive calls with each recursive call resulting in 3 more calls, 2 of those calls will recurse again for the last time)

(ab, xy)
        (a, xy)
                (<empty>, xy)
                (a, x)
                        (<empty>, x)
                        (a, <empty>)
                        (<empty>, <empty>)
                (<empty>, x)
        (ab, x)
                (a, x)
                        (<empty>, x)
                        (a, <empty>)
                        (<empty>, <empty>)
                (ab, <empty>)
                (a, <empty>)
        (a, x)
                (<empty>, x)
                (a, <empty>)
                (<empty>, <empty>)

Note that every decent (processing of last character from the string) in the call tree, does not reduce n by 1, causing the total to be between (3^(n+1)-1)/2 to (3^(n+2)-1)/2 calls

Hope that sheds enough light on the code's performance

I haven't analyzed your algorithm or implementation too much, but off top of my head, I can tell you somethings to improve the performance

  1. Use character array instead of strings in this case as parameters. And pass pointer to the last character of the array that is under consideration. This will not only reduce a lot of unwanted allocations, but remove all substring calls
  2. Use dynamic programming i.e. store the results of your call and look that up before firing off the recursive calls, since the same recursive calls are made a lot of times
Vikhram
  • 4,294
  • 1
  • 20
  • 32