In the past few days I've researched this extensively, I've read so many things that I am now more confused then ever. How does one find the longest common sub string in a large data set? The idea is to remove duplicate content from this data set (of varying lengths, so the algo will need to run continuously). By large data set I mean approximately 100mb of text.
Suffix tree? Suffix array? Rabin-Karp? What's the best way? And is there a library out there that can help me?
Really hoping for a good response, my head hurts a lot. Thank you! :-)