12

I am looking for an algorithm (possibly implemented in Python) able to find the most REPETITIVE sequence in a string. Where for REPETITIVE, I mean any combination of chars that is repeated over and over without interruption (tandem repeat).

The algorithm I am looking for is not the same as the "find the most common word" one. In fact, the repetitive block doesn't need to be the most common word (substring) in the string.

For example:

s = 'asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs'
> f(s)
'UBAUBAUBAUBAUBA' #the "most common word" algo would return 'BA'

Unfortunately, I have no idea on how to tackle this. Any help is very welcome.


UPDATE

A little extra example to clarify that I want to be returned the sequence with the most number of repetition, whatever its basic building block is.

g = 'some noisy spacer'
s = g + 'AB'*5 + g + '_ABCDEF'*2 + g + 'AB'*3
> f(s)
'ABABABABAB' #the one with the most repetitions, not the max len

Examples from @rici:

s = 'aaabcabc'
> f(s)
'abcabc'

s = 'ababcababc'
> f(s)
'ababcababc' #'abab' would also be a solution here
             # since it is repeated 2 times in a row as 'ababcababc'.
             # The proper algorithm would return both solutions.
Andronicus
  • 25,419
  • 17
  • 47
  • 88
alec_djinn
  • 10,104
  • 8
  • 46
  • 71

4 Answers4

13

With combination of re.findall() (using specific regex patten) and max() functions:

import re

#  extended sample string
s = 'asdfewfUBAUBAUBAUBAUBAasdkjnfencsADADADAD sometext'

def find_longest_rep(s):
    result = max(re.findall(r'((\w+?)\2+)', s), key=lambda t: len(t[0]))
    return result[0]

print(find_longest_rep(s))

The output:

UBAUBAUBAUBAUBA

The crucial pattern:

  • ((\w+?)\2+):
    • (....) - the outermost captured group which is the 1st captured group
    • (\w+?) - any non-whitespace character sequence enclosed into the 2nd captured group; +? - quantifier, matches between one and unlimited times, as few times as possible, expanding as needed
    • \2+ - matches the same text as most recently matched by the 2nd capturing group
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • Could you please share the expected time complexity of the program? – Ashutosh Feb 19 '18 at 18:11
  • Wonderful solution! Thanks. I will test it with some examples before accepting the answer. – alec_djinn Feb 19 '18 at 18:15
  • Here are two examples of strings which fail: `longest_rep('aaabcabc') => 'aaa'` (should be `'abcabc'`); `longest_rep('ababcababc') => 'abab'` (should be `'ababcababc'`) – rici Feb 19 '18 at 18:56
  • @rici You are right, I have added a more meaningful example to my post. However, this solution is still very helpful `re.findall(r'((\w+?)\2+)', s)` finds all the repeated blocks, getting the one with more repetition is then a piece of cake. – alec_djinn Feb 20 '18 at 07:47
  • If nothing better pops up today, I feel this answer deserves to be accepted since it gave me a very useful suggestion on how to proceed with the implementation of the algorithm I am looking for. – alec_djinn Feb 20 '18 at 07:53
  • @alec_djinn: my point was that regex does *not* find all repeated blocks under certain circumstances. For example, on input `'aabcabcabc'`, which includes the thrice-repeated 'abcabcabc', findall returns `[('aa', 'a'), ('bcabca', 'bca')]`. Similarly, on `'ababcababcababc'`, which is three repetitions of `ababc`, it returns only blocks repeated twice: `[('abab', 'ab'), ('cababcabab', 'cabab')]`. – rici Feb 20 '18 at 14:39
  • @rici Thanks for your insight! Tha is indeed a problem. I have added your examples in the post. Do you have any solution to offer? – alec_djinn Feb 20 '18 at 15:09
  • @alec_djinn: I don't have a solution offhand, other than the obvious O(n^2) solution of looking at every substring. (That can be made more efficient by terminating the endpoint scan early; in practice it is probably fast enough unless your strings are very long.) But I suspect that an O(n) solution exists. – rici Feb 20 '18 at 19:08
7

Here is the solution based on ((\w+?)\2+) regex but with additional improvements:

import re
from itertools import chain


def repetitive(sequence, rep_min_len=1):
    """Find the most repetitive sequence in a string.

    :param str sequence: string for search
    :param int rep_min_len: minimal length of repetitive substring
    :return the most repetitive substring or None
    """
    greedy, non_greedy = re.compile(r'((\w+)\2+)'), re.compile(r'((\w+?)\2+)')

    all_rep_seach = lambda regex: \
        (regex.search(sequence[shift:]) for shift in range(len(sequence)))

    searched = list(
        res.groups()
        for res in chain(all_rep_seach(greedy), all_rep_seach(non_greedy))
        if res)

    if not sequence:
        return None

    cmp_key = lambda res: res[0].count(res[1]) if len(res[1]) >= rep_min_len else 0
    return max(searched, key=cmp_key)[0]

You can test it like so:

def check(seq, expected, rep_min_len=1):
    result = repetitive(seq, rep_min_len)
    print('%s => %s' % (seq, result))
    assert result == expected, expected


check('asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs', 'UBAUBAUBAUBAUBA')
check('some noisy spacerABABABABABsome noisy spacer_ABCDEF_ABCDEFsome noisy spacerABABAB', 'ABABABABAB')
check('aaabcabc', 'aaa')
check('aaabcabc', 'abcabc', rep_min_len=2)
check('ababcababc', 'ababcababc')
check('ababcababcababc', 'ababcababcababc')

Key features:

  1. used greedy ((\w+)\2+) and non-greedy ((\w+)\2+?) regex;
  2. search repetitive substring in all substrings with the shift from the beginning (e.g.'string' => ['string', 'tring', 'ring', 'ing', 'ng', 'g']);
  3. selection is based on the number of repetitions not on the length of subsequence (e.g. for 'ABABABAB_ABCDEF_ABCDEF' result will be 'ABABABAB', not '_ABCDEF_ABCDEF');
  4. the minimum length of a repeating sequence is matters (see 'aaabcabc' check).
Serhii Kostel
  • 176
  • 1
  • 3
4

What you are searching for is an algorithm to find the 'largest' primitive tandem repeat in a string. Here is a paper describing a linear time algorithm to find all tandem repeats in a string and by extension all primitive tandem repeats. Gusfield. Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String

M. Jajeh
  • 140
  • 3
1

Here is a brute force algorithm that I wrote. Maybe it will be useful:

def find_most_repetitive_substring(string):
max_counter = 1
position, substring_length, times = 0, 0, 0
for i in range(len(string)):
    for j in range(len(string) - i):
        counter = 1
        if j == 0:
            continue
        while True:
            if string[i + counter * j: i + (counter + 1) * j] != string[i: i + j] or i + (counter + 1) * j > len(string):
                if counter > max_counter:
                    max_counter = counter
                    position, substring_length, times = i, j, counter
                break
            else:
                counter += 1
return string[position: position + substring_length * times]
Andronicus
  • 25,419
  • 17
  • 47
  • 88