Find repeating substring of Length N

Question

I have to make a Java program which finds all repeating sub-strings of length n in a given String. The input is string is extremely long and a brute-force approach takes too much time.

I alread tried:
Presently I am finding each sub-string separately and checking for repetitions of that sub-string using the KMP alogrithm. This too is taking too much time.

What is a more efficient approach for this problem?

Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it. — Prophet, Jan 04 '15 at 10:40
Not sure why this question is voted off as "too broad" - there is a specific question at hand, and @Program_Dude also provided what he has tried and why it failed. — amit, Jan 04 '15 at 10:40
@Amit what he did? he asked for some code? So it is off-topic — Prophet, Jan 04 '15 at 10:42
@Eliyahu It is completely on-topic. He described a problem, described a solution, explained why the suggested solution fails, and asked for an alternative. More "advanced" problem than "How to parse a string?" is perfectly on topic. — amit, Jan 04 '15 at 10:43
you have to share some code so if someone can answer you concern can guide you to specific point in the code, to enhance or alter, otherwise, anyone have to solve the whole question in order to present a sol. which is not the purpose of this website, thats why your question is voted `too broad` — Yazan, Jan 04 '15 at 10:44
@Amit especially after your editing where you deleted the problematic part of the question — Prophet, Jan 04 '15 at 10:44
i think **too broad There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs.** fits here, the OP not only asking for an answer, but OP asks for the best answer, a fast answer, which will be somehow off-topic — Yazan, Jan 04 '15 at 10:46
@Yazan Do you see code in these problems: [Pair socks from a pile](http://stackoverflow.com/questions/14415881/pair-socks-from-a-pile-efficiently), [Algorithm improvement for Coca-Cola can shape recognition](http://stackoverflow.com/questions/10168686/algorithm-improvement-for-coca-cola-can-shape-recognition) , [Plain English explanation of Big O](http://stackoverflow.com/q/487258/572670)? Yet they are perfectly on topic — amit, Jan 04 '15 at 10:46
@Eliyahu That's why we have editting powers, as more experienced users, we can easily make question more understandable and fit perfectly. These tools should be used before voting to close. — amit, Jan 04 '15 at 10:47
@amit i don't want to argue on this just for arguing, and it's good that you have brought some older questions, it's clear that these questions are way better and higher quality than this one. thats my opinion, i did not even downvote or flag this question, just saying :) — Yazan, Jan 04 '15 at 10:51
@Amit If so there should also be an option to remove the moderation flag by the reviewer. I saw this question before your editing, it was bad and I suggested removing / editing it. After your edit the question looks better so my flag will be disputed or even reclined. Do you think it's fair? — Prophet, Jan 04 '15 at 10:54

peter.petrov · Answer 1 · 2015-01-04T11:01:28.283

1) You should look at using a suffix tree data structure.

Suffix Tree

This data structure can be built in O(N * log N) time
(I think even in O(N) time using Ukkonen's algorithm)
where N is the size/length of the input string.
It then allows for solving many (otherwise) difficult
tasks in O(M) time where M is the size/length of the pattern.

So even though I didn't try your particular problem, I am pretty sure that
if you use a suffix tree and a smart formulation of your problem, then the
problem can be solved by using a suffix tree (in reasonable O time).

2) A very good book on these (and related) subjects is this one:

Algorithms on Strings, Trees and Sequences

It's not really easy to read though unless you're well-trained in algorithms.
But OK, reading such things is the only way to get well-trained ;)

3) I suggest you have a quick look at this algorithm too.

Aho-Corasick Algorithm

Even though, I am not sure but... this one might be somewhat
off-topic with respect to your particular problem.

the answer is quite useful. Thanks. – Rohith K Jan 04 '15 at 11:06 — Rohith K, Jan 04 '15 at 11:06

score 2 · Answer 2 · edited Jan 04 '15 at 12:45

I am going to take @peter.petrov's suggestion and enhance it by explaining how can one actually use a suffix tree to solve the problem:

 1. Create a suffix tree from the string, let it be `T`.
 2. Find all nodes of depth `n` in the tree, let that set of nodes be `S`. This can be done using DFS, for example.
 3. For each node `n` in `S`, do the following:
     3.1. Do a DFS, and count the number of terminals `n` leads to. Let this number be `count`
     3.2. If `count>1`, yield the substring that is related to `n` (the path from root to `n`), and `count`

Note that this algorithm treats any substring of length n and add it to the set S, and from there it search for how many times this was actually a substring by counting the number of terminals this substring leads to.

This means that the complexity of the problem is O(Creation + Traversal) - meaning, you first create the tree and then you traverse it (easy to see you don't traverse in steps 2-3 each node in the tree more than once). Since the traversal is obviously "faster" than creation of the tree - it leaves you with O(Creation), which as was pointed by @perer.petrov is O(|S|) or O(|S|log|S|) depending on your algorithm of choice.

Find repeating substring of Length N

2 Answers2