Longest repeated substring in massive string

Question

Given a long string, find the longest repeated sub-string.

The brute-force approach of course is to find all substrings and check the substrings of the remaining string, but the string(s) in question have millions of characters (like a DNA sequence, AGGCTAGCT etc) and I'd like something that finishes before the universe collapses in on itself.

Tried a number of approaches, and I have one solution that works quite fast on strings of up to several million, but takes literally forever (6+ hours) for larger strings, particularly when the length of the repeated sequence gets really long.

def find_lrs(text, cntr=2):
    sol = (0, 0, 0)
    del_list = ['01','01','01']
    
    while len(del_list) != 0:
        d = defaultdict(list)
        
        for i in range(len(text)):
            d[text[i:i + cntr]].append(i)
        
        del_list = [(item, d[item]) for item in d if len(d[item]) > 1]

        # if list is empty, we're done
        if len(del_list) == 0:
            return sol
        else:
            sol = (del_list[0][1][0], (del_list[0][1][1]),len(del_list[0][0]))
        cntr += 1

    return sol

I know it's ugly, but hey, I'm a beginner, and I'm just happy I got something to work. Idea is to go through the string starting out with length-2 substrings as the keys, and the index the substring is at the value. If the text was, say, 'BANANA', after the first pass through, the dict would look like this:

{'BA': [0], 'AN': [1, 3], 'NA': [2, 4], 'A': [5]}

BA shows up only once - starting at index 0. AN and NA show up twice, showing up at index 1/3 and 2/4, respectively.

I then create a list that only includes keys that showed up at least twice. In the example above, we can remove BA, since it only showed up once - if there's no substring of length 2 starting out with 'BA', there won't be an substring of length 3 starting with BA. So after the first past through the pruned list is: [('AN', [1, 3]), ('NA', [2, 4])]

Since there is at least two possibilities, we save the longest substring and indices found so far and increment the substring length to 3. We continue until no substring was repeated.

As noted, this works on strings up to 10 million in about 2 minutes, which apparently is reasonable - BUT, that's with the longest repeated sequence being fairly short. On a shorter string but longer repeated sequence, it takes -hours- to run. I suspect that it has something to do with how big the dictionary gets, but not quite sure why.

What I'd like to do of course is keep the dictionary short by removing the substrings that clearly aren't repeated, but I can't delete items from the dict while iterating over it. I know there are suffix tree approaches and such that - for now - are outside my ken.

Could simply be that this is beyond my current knowledge, which of course is fine, but I can't help shaking the idea that there is a solution here.

Can the substring overlap itself, for example: 'ANA' has two occurrences in 'BANANA' — Dani Mesejo, Nov 27 '20 at 09:10
This seems to be a well-researched problem in Computer Science: https://en.wikipedia.org/wiki/Longest_repeated_substring_problem — piterbarg, Nov 27 '20 at 09:11
It feels like you could reduce the searching space by tracking where the substrings occur. If you find a repeated substring of length N on 3 places, you need only check the N+1 length strings at those 3 places against each other, right? It seems to me they can't occur anywhere else. Or am I mistaken? — Kraay89, Nov 27 '20 at 09:33
Yes, the substring can overlap. The longest substring in BANANA is ANA, length 3, starting at [1] and [3]. — Dragon-Ash, Nov 27 '20 at 09:39
Does this answer your question? [Find longest repetitive sequence in a string](https://stackoverflow.com/questions/11090289/find-longest-repetitive-sequence-in-a-string) — mkrieger1, Nov 27 '20 at 09:51
Kraay89 - yes, I think that is correct - if we've found all length-N substrings, we only need to check N+1 for N-length substrings that showed up at least twice. That's where I'm struggling - how can I eliminate keys that only show up once or otherwise skip those when I move to check N+1? — Dragon-Ash, Nov 27 '20 at 11:18

score 1 · Answer 1 · answered Dec 05 '20 at 16:27

I forgot to update this. After going over my code again, away from my PC - literally writing out little diagrams on my iPad - I realized that the code above wasn't doing what I thought it was doing.

As noted above, my plan of attack was to start out by going through the string starting out with length-2 substrings as the keys, and the index the substring is at the value, creating a list that captures only length-2 substrings that occured at least twice, and only look at those locations.

All well and good - but look closely and you'll realize that I'm never actually updating the default dictionary to only have locations with two or more repeats! //bangs head against table.

I ultimately came up with two solutions. The first solution used a slightly different approach, the 'sorted suffixes' approach. This gets all the suffixes of the word, then sorts them in alphabetical order. For example, the suffixes of "BANANA", sorted, would be: A ANA ANANA BANANA NA NANA

We then look at each adjacent suffix and find how many letters each pair start out having in common. A and ANA have only 'A' in common. ANA and ANANA have "ANA" in common, so we have length 3 as the longest repeated substring. ANANA and BANANA have nothing in common at the start, ditto BANANA and NA. NA and NANA have "NA" in common. So 'ANA', length 3, is the longest repeated substring.

I made a little helper function to do the actual comparing. The code looks like this:

def longest_prefix(suf1, suf2, mx=None):
    min_len = min(len(suf1), len(suf2))
    for i in range(min_len):
        if suf1[i] != suf2[i]:
            return (suf1[0:i], len(suf1[0:i]))
    return (suf1[0:i], len(suf1[0:i]))


def longest_repeat(txt):
    lst = sorted([text[i:] for i in range(len(text))])
    print(lst)
    mxLen = 0
    mx_string = ""
    for x in range(len(lst) - 1):
        temp = longest_prefix(lst[x], lst[x + 1])
        if temp[1] > mxLen:
            mxLen = temp[1]
            mx_string = temp[0]
    first = txt.find(mx_string)
    last = txt.rfind(mx_string)
    return (first, last, mxLen)

This works. I then went back and relooked at my original code and saw that I wasn't resetting the dictionary. The key is that after each pass through I update the dictionary to -only- look at repeat candidates.

def longest_repeat(text):
    # create the initial dictionary with all length-2 repeats
    
     cntr = 2 # size of initial substring length we look for
     d = defaultdict(list)
     for i in range(len(text)):
         d[text[i:i + cntr]].append(i)
    
     # find any item in dict that wasn't repeated at least once
     del_list = [(d[item]) for item in d if len(d[item]) > 1]
     sol = (0,0,0)
    
     # Keep looking as long as del_list isn't empty,
     while len(del_list) > 0:
         d = defaultdict(list) # reset dictionary
         cntr += 1 # increment search length
         for item in del_list:
             for i in item:
                 d[text[i:i + cntr]].append(i)
     # filter as above
     del_list = [(d[item]) for item in d if len(d[item]) > 1]
    
     # if not empty, update solution
     if len(del_list) != 0:
         sol = (del_list[0][0], del_list[0][1], cntr)
     return sol

This was quite fast, and I think it's easier to follow.

Longest repeated substring in massive string

1 Answers1