Finding the Longest Common Substring in a Large Data Set

Question

In the past few days I've researched this extensively, I've read so many things that I am now more confused then ever. How does one find the longest common sub string in a large data set? The idea is to remove duplicate content from this data set (of varying lengths, so the algo will need to run continuously). By large data set I mean approximately 100mb of text.

Suffix tree? Suffix array? Rabin-Karp? What's the best way? And is there a library out there that can help me?

Really hoping for a good response, my head hurts a lot. Thank you! :-)

jonderry: I probably wasn't clear, I meant that after each pass it'll need to find the next longest substring, and so forth. — diffuse, Nov 17 '10 at 20:54
Duplicate of http://stackoverflow.com/questions/1916218/find-the-longest-common-starting-substring-in-a-set-of-strings? — Patrick, Nov 17 '10 at 21:16

sdadffdfd · Answer 1 · 2010-11-17T21:24:52.030

I've always been using suffix arrays. Because I've been told always this is the fastest way there.

If you are running out of memory on the machine the algorithm is running, you can always save your array in a file on your hard-drive. It will slow down considerably the algorithm but it will provide the result, alt least.

And I don't think that a library will do a better job than a good written and clean algorithm.

LE: Btw, you don't need to remove any data in order to find the longest common substring.

From the Longest Common Substring Problem:

function LCSubstr(S[1..m], T[1..n])
    L := array(1..m, 1..n)
    z := 0
    ret := {}
    for i := 1..m
        for j := 1..n
            if S[i] = T[j]
                if i = 1 or j = 1
                    L[i,j] := 1
                else
                    L[i,j] := L[i-1,j-1] + 1
                if L[i,j] > z
                    z := L[i,j]
                    ret := {}
                if L[i,j] = z
                    ret := ret ∪ {S[i-z+1..i]}
    return ret

You don't need to sort anything, you have only to parse once your 100MB data, and buid an n*m array of chars to store your computing. Also check this page

LE: Rabin-Karp is a pattern matching algorithm, you don't need it here.

Can you provide me with some sample code or point to resources? I just figured that sorting an array of 100mb items would take a really long time, maybe I'm wrong. — diffuse, Nov 17 '10 at 20:56
The article above is perfect.. maximum complexity is O(nm) where n and m are the lengths of the strings compared.. I don't think there is a faster way of doing it. — sdadffdfd, Nov 17 '10 at 21:06
It sounds like the question is about removing duplicate bits of text in a single file (I think), in which case you will want `for j := i+1..n`. Also, definitely only store the last and current rows, since otherwise `L` would be about 1e16 in size! — Jeffrey L Whitledge, Nov 17 '10 at 21:27

Finding the Longest Common Substring in a Large Data Set

1 Answers1