Find the longest repeated substring in a group of strings

Question

I'm trying to find a way to find the largest duplicate substring in a group of strings. The longest duplicate substring problem usually applies to a single string, instead of a group of strings. What type of algorithm would be useful for finding the largest duplicate substring in a group of strings?

Finding the largest duplicate string in a group of files (in order to remove duplicate code in large software libraries) is the main use case that I have in mind, but there would be many other use cases for this algorithm as well.

For example, I'd want to find the longest duplicate substring in this group of strings:

"Hello world, this is the first string."
"Hello to the world, this is the second string."
"Hello world.  This is the third string."
"This is the third string."

In this case, "This is the third string." would be the longest repeated string (i. e., the longest string that appears in more than one of these strings).

One possible approach would be to generate a separator for each of the strings, and concatenate each string into one string, with the separator being in between each string. The separator would need to be a string that was not found in any of the existing strings. Then I could use the same algorithm that is used to find the longest duplicate substring for a single string. — Anderson Green, Mar 06 '13 at 21:48
@Andy Why, hello there! Having fun in SMC, are we? ;) Anyway, I'm pretty sure that if you just concatenate the strings and then apply the original algorithm, you may have better luck. — Sean Allred, Mar 06 '13 at 21:59
Although, you may want to tokenize your input first, as to not be looking character-by-character. Depending on how you implement it on that end, it could speed up the overall implementation by quite a bit. — Sean Allred, Mar 06 '13 at 22:00
actually `his is the ` is the longest substring present in all the strings followed by `string.` — CSᵠ, Mar 06 '13 at 23:29
@kaᵠ I'm looking for the longest substring that is repeated in at least two of the strings, not the longest substring that is repeated in all of the strings. `"This is the third string."` is the longest substring that is repeated more than once. — Anderson Green, Mar 07 '13 at 19:32
I think I've finally found a duplicate of this question: http://stackoverflow.com/questions/1410822/how-can-i-detect-common-substrings-in-a-list-of-strings — Anderson Green, Jun 04 '13 at 07:57

MihaiB · Answer 1 · 2013-03-06T22:43:52.470

0

Perhaps this is what you are looking for, but you will need to apply the algorithm for more than two strings. Which is not that hard, if you think about it. Also, take a look at this page. Using backtracking wouldn't be such a good idea.

edited Mar 06 '13 at 22:43

answered Mar 06 '13 at 22:38

MihaiB

129
1
4
17

I'm a bit confused - what type of backtracking are you referring to? – Anderson Green Mar 07 '13 at 04:13
@AndersonGreen You can do it with backtracking, for example the recursive one. But it would be probably the most inefficient thing you can do, if you had many strings or long ones, you could go to sleep and hope it will finish until you wake up. Take a look at LCS and LC string problem, in the links. Should be helpful. Also, simplecoder's Suffix Array DataStructure will help you – MihaiB Mar 07 '13 at 08:01

score 0 · Answer 2 · answered May 27 '13 at 00:52

0

The answer to your question is starts from slides 60 Click here

Basically, we list all the possible sufixes of the string input(Linear time). Sort them(NLogN), and find the longest one by going through the sorted list(Linear time)

answered May 27 '13 at 00:52

checai

836
4
11
17

score 0 · Answer 3 · answered May 27 '13 at 02:11

Create a Trie data structure (a.k.a. a prefix tree) for each string
- Let's call it T(i) for string i
Create an empty hash map with key string and value int
- Let's call it M
For each Trie T(i), for each node P (where P is a prefix string) in T(i),
- if key P is already in M, then increment M[P]
- else, insert M[P] = 1
Find the (P*,C*) pair in M such that:
- C* >= 2 (*)
- length(P*) is maximum among all such pairs
P* is the string that you want

(*) If you wanted to get the longest substring common to K of the strings, you would replace the 2 with K

Find the longest repeated substring in a group of strings

3 Answers3