longest common substring between 2 HUGE files - out of memory: java heap space

Question

I'm completely brain fried after this, I need to find the longest common substring between 2 files, a small one and a HUGE one. I don't even know where to start to begin the search, heres what I have so far

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

public class MyString
{
    public static void main (String[] args) throws IOException
    {
        BufferedReader br = new BufferedReader(new FileReader("MobyDick.txt"));
        BufferedReader br2 = new BufferedReader(new FileReader("WarAndPeace.txt"));
        String md, wp;
        StringBuilder s = new StringBuilder();
        while ((md = br.readLine()) != null)
        {
            s.append(md).append(" ");
        }
        md = s + "";
        s.setLength(0);
        while ((wp = br2.readLine()) != null)
        {
            s.append(wp).append(" ");
        }
        wp = s + "";
        s.setLength(0);

        md = md.replaceAll("\\s+", " "); //rids of double spaces
        wp = wp.replaceAll("\\s+", " "); //rids of double spaces
    }
}

what i did so far was to put each file into a string builder, and then into a string to rid of double spaces (it came up a lot on MobyDick.txt). I found this code

public static String longestSubstring(String str1, String str2) {

StringBuilder sb = new StringBuilder();
if (str1 == null || str1.isEmpty() || str2 == null || str2.isEmpty())
  return "";

// ignore case
str1 = str1.toLowerCase();
str2 = str2.toLowerCase();

// java initializes them already with 0
int[][] num = new int[str1.length()][str2.length()];
int maxlen = 0;
int lastSubsBegin = 0;

for (int i = 0; i < str1.length(); i++) {
for (int j = 0; j < str2.length(); j++) {
if (str1.charAt(i) == str2.charAt(j)) {
if ((i == 0) || (j == 0))
   num[i][j] = 1;
else
   num[i][j] = 1 + num[i - 1][j - 1];

if (num[i][j] > maxlen) {
  maxlen = num[i][j];
  // generate substring from str1 => i
  int thisSubsBegin = i - num[i][j] + 1;
  if (lastSubsBegin == thisSubsBegin) {
     //if the current LCS is the same as the last time this block ran
     sb.append(str1.charAt(i));
  } else {
     //this block resets the string builder if a different LCS is found
     lastSubsBegin = thisSubsBegin;
     sb = new StringBuilder();
     sb.append(str1.substring(lastSubsBegin, i + 1));
  }
  }
  }
  }}

  return sb.toString();
  }

this code helps but only on small files, every time I run it with the big files I get a "out of memory: java heap space" error. I need the right algorithm to get away from the heap space issue, and no i cant increase java memory, can anyone help or point me in the right direction?

Maybe you should look at the way your approaching the problem. One thing that comes to my mind to save memory is that you should only deal with blocks the size of the smallest file (since that is the largest potential match). After comparing the small file and the first block of the large file, iterate through it. Not the most efficient method, but maybe it will get you started in the right direction. — user3507600, May 19 '14 at 20:15
I did try to do that actually, on this line int[][] num = new int[str1.length()][str2.length()]; the maximum number it can be is 1000 (trial and error) so figured i can break up the text into 1000 sections for every 1 million characters, but WarAndPeace is around 3 million+ characters, and that would need 4000+ sections of the text, which is just as bad because i also need to worry about how long it takes as well. — Steven R, May 19 '14 at 20:28
You may want to check out [this thread](http://stackoverflow.com/questions/7107517/how-to-compare-large-text-files) as it is also doing comparisons of large files. I would suggest first worrying about getting it to run through before worrying about speed (not always the best rules to live by, but getting something can often lead to insights on how to do a faster method). I believe though, you'd be better off breaking both texts into smaller parts, and comparing those smaller parts. The key is to make sure you account for matches that extend beyond the size of the blocks. — user3507600, May 19 '14 at 20:34

score 2 · Answer 1 · edited May 23 '17 at 12:05

First you need to identify exactly why this is such a memory hog, and then you can start to work around it.

This declaration jumps out as a potential problem:

int[][] num = new int[str1.length()][str2.length()];

War and Peace is over 3 million characters long, and Moby Dick is about half the length of it so we will conservatively say it's a million characters long.

You are trying to allocate space for 3,000,000,000,000 integers, each of which is 4 bytes, which works out to be 12,000,000,000,000 bytes or a bit under 11 TB.

Hopefully it's clear why the algorithm is not suited for strings of this length.

Thankfully, one of the principle theories in computer science is that you can always trade time for memory and vice versa.

Instead you want to try a generalized suffix tree. This has a memory cost of \Theta(n + m) and can be constructed in \Theta(n + m) which is much more manageable.

Here is an excellent guide to a O(n) algorithm to generate such trees.

Once you have the suffix tree in place, finding the LCS can be done in constant time by finding the deepest node in the tree whose subtree contains a substring of both input strings. A typical strategy is to mark all nodes 'v' with a flag 'i' if they satisfy the property:

The subtree with root v contains a substring of the string S_i

and then find the deepest node v where v is marked as i for all i in the range (in this case, just 0 and 1).

longest common substring between 2 HUGE files - out of memory: java heap space

1 Answers1

Linked