Given a long string, find the longest repeated sub-string.
The brute-force approach of course is to find all substrings and check the substrings of the remaining string, but the string(s) in question have millions of characters (like a DNA sequence, AGGCTAGCT etc) and I'd like something that finishes before the universe collapses in on itself.
Tried a number of approaches, and I have one solution that works quite fast on strings of up to several million, but takes literally forever (6+ hours) for larger strings, particularly when the length of the repeated sequence gets really long.
def find_lrs(text, cntr=2):
sol = (0, 0, 0)
del_list = ['01','01','01']
while len(del_list) != 0:
d = defaultdict(list)
for i in range(len(text)):
d[text[i:i + cntr]].append(i)
del_list = [(item, d[item]) for item in d if len(d[item]) > 1]
# if list is empty, we're done
if len(del_list) == 0:
return sol
else:
sol = (del_list[0][1][0], (del_list[0][1][1]),len(del_list[0][0]))
cntr += 1
return sol
I know it's ugly, but hey, I'm a beginner, and I'm just happy I got something to work. Idea is to go through the string starting out with length-2 substrings as the keys, and the index the substring is at the value. If the text was, say, 'BANANA', after the first pass through, the dict would look like this:
{'BA': [0], 'AN': [1, 3], 'NA': [2, 4], 'A': [5]}
BA shows up only once - starting at index 0. AN and NA show up twice, showing up at index 1/3 and 2/4, respectively.
I then create a list that only includes keys that showed up at least twice. In the example above, we can remove BA, since it only showed up once - if there's no substring of length 2 starting out with 'BA', there won't be an substring of length 3 starting with BA. So after the first past through the pruned list is: [('AN', [1, 3]), ('NA', [2, 4])]
Since there is at least two possibilities, we save the longest substring and indices found so far and increment the substring length to 3. We continue until no substring was repeated.
As noted, this works on strings up to 10 million in about 2 minutes, which apparently is reasonable - BUT, that's with the longest repeated sequence being fairly short. On a shorter string but longer repeated sequence, it takes -hours- to run. I suspect that it has something to do with how big the dictionary gets, but not quite sure why.
What I'd like to do of course is keep the dictionary short by removing the substrings that clearly aren't repeated, but I can't delete items from the dict while iterating over it. I know there are suffix tree approaches and such that - for now - are outside my ken.
Could simply be that this is beyond my current knowledge, which of course is fine, but I can't help shaking the idea that there is a solution here.