42

I need to find the longest sequence in a string with the caveat that the sequence must be repeated three or more times. So, for example, if my string is:

fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld

then I would like the value "helloworld" to be returned.

I know of a few ways of accomplishing this but the problem I'm facing is that the actual string is absurdly large so I'm really looking for a method that can do it in a timely fashion.

djechlin
  • 59,258
  • 35
  • 162
  • 290
Snesticle
  • 935
  • 2
  • 8
  • 15
  • 7
    I don't know if there is a regex solution to this problem. This _can't_ be a regular expression, but python might have a non regular extension that does something like this. In the general case this is the LCS problem, which can be solved using dynamic programming: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem – Kristopher Micinski Jun 18 '12 at 20:14

7 Answers7

33

This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).

That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I'm not sure whether this is a good implementation.

Another option would be to use suffix arrays in conjunction with LCP arrays. You can iterate over pairs of adjacent elements in the LCP array, taking the minimum of each pair, and store the largest number you find this way. That will correspond to the length of the longest string that repeats at least three times, and from there you can then read off the string itself.

There are several simple algorithms for building suffix arrays (the Manber-Myers algorithm runs in time O(n log n) and isn't too hard to code up), and Kasai's algorithm builds LCP arrays in time O(n) and is fairly straightforward to code up.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • 1
    “notoriously hard to construct” – say what? – Konrad Rudolph Jun 18 '12 at 20:20
  • 9
    @KonradRudolph- I don't know of any "simple" algorithms for constructing suffix trees in linear time. The two algorithms I do know (Ukkonen's algorithm and the DC3 algorithm) are extremely complicated and do not have obvious proofs of correctness. That said, if I'm mistaken about this, I would love to stand corrected! – templatetypedef Jun 18 '12 at 20:22
  • I agree about the proofs not being trivial. But there exist pseudo-codes for Ukkonen’s algorithm which are straightforward to adapt. Furthermore, while linear time algorithms are hard to find, there are trivially derived non-linear construction algorithms which nevertheless perform quite well in practice. – Konrad Rudolph Jun 18 '12 at 20:25
  • 3
    This is the first time I've seen anyone post a helpful / non-referential answer to any variant of LCS. Thanks for the library link. – Profane Dec 24 '12 at 06:11
  • @templatetypedef The Ukkonnen's algorithm is (IMHO) well-explained in the [original paper](https://link.springer.com/article/10.1007/BF01206331). You first have to understand that how the algorithms works for tries (one character per edge). Intuitively, the active points correspond to nodes where the tree can grow, i.e. the nodes labeled by the last processed character. The Ukkonnen algorithm uses various tricks (suffix inclusions, etc.) to find the set of current active nodes more efficiently. The suffix tree version uses pair of indices to save memory and ease tree updates. – mando Jan 26 '21 at 15:54
12

Use defaultdict to tally each substring beginning with each position in the input string. The OP wasn't clear whether overlapping matches should or shouldn't be included, this brute force method includes them.

from collections import defaultdict

def getsubs(loc, s):
    substr = s[loc:]
    i = -1
    while(substr):
        yield substr
        substr = s[loc:i]
        i -= 1

def longestRepetitiveSubstring(r, minocc=3):
    occ = defaultdict(int)
    # tally all occurrences of all substrings
    for i in range(len(r)):
        for sub in getsubs(i,r):
            occ[sub] += 1

    # filter out all substrings with fewer than minocc occurrences
    occ_minocc = [k for k,v in occ.items() if v >= minocc]

    if occ_minocc:
        maxkey =  max(occ_minocc, key=len)
        return maxkey, occ[maxkey]
    else:
        raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))

prints:

('helloworld', 3)
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • 3
    I truly love this solution but unfortunately my strings are typically just too big. I'd wager, however, that your answer will be very useful to a number of people landing here through Google, as it does solve the original example I gave quite nicely. – Snesticle Jun 19 '12 at 13:35
4

Let's start from the end, count the frequency and stop as soon as the most frequent element appears 3 or more times.

from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1)[::-1]:
    substrings=[a[i:i+n] for i in range(len(a)-n+1)]
    freqs=Counter(substrings)
    if freqs.most_common(1)[0][1]>=3:
        seq=freqs.most_common(1)[0][0]
        break
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)

Result:

>>> sequence 'helloworld' of length 10 occurs 3 or more times

Edit: if you have the feeling that you're dealing with random input and the common substring should be of small length, you better start (if you need the speed) with small substrings and stop when you can't find any that appear at least 3 time:

from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1):
    substrings=[a[i:i+n] for i in range(len(a)-n+1)]
    freqs=Counter(substrings)
    if freqs.most_common(1)[0][1]<3:
        n-=1
        break
    else:
        seq=freqs.most_common(1)[0][0]
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times) 

The same result as above.

Max Li
  • 5,069
  • 3
  • 23
  • 35
  • Which is to say, build each set of substrings M items long, starting with `M=len(a)/N` (where N is the minimum number of reps), and count the reps of each. If no substring has number of occurrences >= N, then substract 1 from M and try again. Yes? – PaulMcG Jun 18 '12 at 21:44
  • I'm attempting something similar in the opposite direction (small to large). Any idea what the time complexity is for your approach? – Matt Coughlin Jun 18 '12 at 21:48
  • @MattCoughlin the worst case the number of substrings looks at the 1st sight quadratic: it's sum(len(a)-([len(a)/times] - i)+1) where i runs from 0 to [len(a)/K] which is O(len(a)^2). Then, you build frequency on each substring which is linear on substring. So, it might be O(len(a)^3) at the end, please correct me if I'm wrong. You can spare some operations on Counter, since you don't need the whole distribution, but only to know that the max freq is >=3. If you know that you're dealing with random input, you better start with small strings. Thus, in practice it must be pretty fast – Max Li Jun 18 '12 at 22:06
1

The first idea that came to mind is searching with progressively larger regular expressions:

import re

text = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
largest = ''
i = 1

while 1:
    m = re.search("(" + ("\w" * i) + ").*\\1.*\\1", text)
    if not m:
        break
    largest = m.group(1)
    i += 1

print largest    # helloworld

The code ran successfully. The time complexity appears to be at least O(n^2).

Matt Coughlin
  • 18,666
  • 3
  • 46
  • 59
0

If you reverse the input string, then feed it to a regex like (.+)(?:.*\1){2}
It should give you the longest string repeated 3 times. (Reverse capture group 1 for the answer)

Edit:
I have to say cancel this way. It's dependent on the first match. Unless its tested against a curr length vs max length so far, in an itterative loop, regex won't work for this.

0

In Python you can use the string count method. We also use an additional generator which will generate all the unique substrings of a given length for our example string.

The code is straightforward:

test_string2 = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'

def generate_substrings_of_length(this_string, length):
    ''' Generates unique substrings of a given length for a given string'''
    for i in range(len(this_string)-2*length+1):
        yield this_string[i:i+length]

def longest_substring(this_string):
    '''Returns the string with at least two repetitions which has maximum length'''
    max_substring = ''
    for subs_length in range(2, len(this_string) // 2 + 1):
        for substring in generate_substrings_of_length(this_string, subs_length):
            count_occurences = this_string.count(substring)
            if count_occurences > 1 :
                if len(substring) > len(max_substring) :
                    max_substring = substring
    return max_substring

I must note here (and this is important) that the generate_substrings_of_length generator does not generate all the substrings of a certain length. It will generate only the required substring to be able to make comparisons. Otherwise we will have some artificial duplicates. For example in the case :

test_string = "banana"

GS = generate_substrings_of_length(test_string , 2)
for i in GS: print(i)

will result :

ba
an
na

and this is enough for what we need.

btrif
  • 307
  • 5
  • 13
-2
from collections import Counter

def Longest(string):

    b = []
    le = []

    for i in set(string):

        for j in range(Counter(string)[i]+1): 
            b.append(i* (j+1))

    for i in b:
        if i in string:
            le.append(i)


    return ([s for s in le if len(s)==len(max( le , key = len))])
xskxzr
  • 12,442
  • 12
  • 37
  • 77