1

I'm trying to solve the edit distance problem. the code I've been using is below.

 public static int minDistance(String word1, String word2) {
    int len1 = word1.length();
    int len2 = word2.length();

    // len1+1, len2+1, because finally return dp[len1][len2]
    int[][] dp = new int[len1 + 1][len2 + 1];

    for (int i = 0; i <= len1; i++) {
        dp[i][0] = i;
    }

    for (int j = 0; j <= len2; j++) {
        dp[0][j] = j;
    }

    //iterate though, and check last char
    for (int i = 0; i < len1; i++) {
        char c1 = word1.charAt(i);
        for (int j = 0; j < len2; j++) {
            char c2 = word2.charAt(j);

            //if last two chars equal
            if (c1 == c2) {
                //update dp value for +1 length
                dp[i + 1][j + 1] = dp[i][j];
            } else {
                int replace = dp[i][j] + 1 ;
                int insert = dp[i][j + 1] + 1  ;
                int delete = dp[i + 1][j] + 1 ;


                int min = replace > insert ? insert : replace;
                min = delete > min ? min : delete;
                dp[i + 1][j + 1] = min;
            }
        }
    }

    return dp[len1][len2];
}

It's a DP approach. The problem it since it use a 2D array we cant solve this problem using above method for large strings. Ex: String length > 100000.

So Is there anyway to modify this algorithm to overcome that difficulty ?

NOTE: The above code will accurately solve the Edit Distance problem for small strings. (which has length below 1000 or near)

As you can see in the code it uses a Java 2D Array "dp[][]" . So we can't initialize a 2D array for large rows and columns.

Ex : If i need to check 2 strings whose lengths are more than 100000

int[][] dp = new int[len1 + 1][len2 + 1];

the above will be

int[][] dp = new int[100000][100000];

So it will give a stackOverflow error.

So the above program only good for small length Strings. What I'm asking is , Is there any way to solve this problem for large strings(length > 100000) efficiently in java.

prime
  • 14,464
  • 14
  • 99
  • 131
  • Why is the input so long? Maybe knowing a bit more about the circumstances would allow us to suggest better options. – Aaron Digulla Oct 06 '14 at 10:22
  • It's when we want to compare 2 strings which has lengths more than 100000. In that situation we can't create java 2D arrays. – prime Oct 06 '14 at 10:23
  • @jurgemaister : I added some more details. And this is not a homework :) – prime Oct 06 '14 at 10:29
  • Wiki page you linked references Ukkonen's optimization which should be enough. And you can go to http://stackoverflow.com/questions/4057513/levenshtein-distance-algorithm-better-than-onm for more ideas. – Deltharis Oct 06 '14 at 10:42

1 Answers1

2

First of all, there's no problem in allocating a 100k x 100k int array in Java, you just have to do it in the Heap, not the Stack (and on a machine with around 80GB of memory :))

Secondly, as a (very direct) hint:

Note that in your loop, you are only ever using 2 rows at a time - row i and row i+1. In fact, you calculate row i+1 from row i. Once you get i+1 you don't need to store row i anymore.

This neat trick allows you to store only 2 rows at the same time, bringing down the space complexity from n^2 to n. Since you stated that this is not homework (even though you're a CS undergrad by your profile...), I'll trust you to come up with the code yourself.

Come to think of it I recall having this exact problem when I was doing a class in my CS degree...

Ordous
  • 3,844
  • 15
  • 25
  • I got your Idea. Thanks for the answer. Btw I'm practicing DP for a programming competition. I found this problem almost in all the references. But when I try to compare large strings it fails. coz the above reason. (we cant allocate memory more than 256MB in normal competitions) so I post the question here in SO to get a hint. Thanks again. I'll check this. – prime Oct 06 '14 at 10:57
  • @prime Then ignore the not-so-subtle remarks. This should bring down memory reqs to far below the maximum. – Ordous Oct 06 '14 at 11:09