A more complex version of "How can I tell if a string repeats itself in Python?"

Question

I was reading this post and I wonder if someone can find the way to catch repetitive motifs into a more complex string.

For example, find all the repetitive motifs in

string = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

Here the repetitive motifs: 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

So, the output should be something like this:

output = {'ACGT': {'repeat': 2,
                   'region': (5,13)},
          'GT': {'repeat': 3,
                 'region': (19,24)},
          'TATACG': {'repeat': 2,
                     'region': (29,40)}}

This example comes from a typical biological phenomena termed microsatellite which is present into the DNA.

UPDATE 1: Asterisks were removed from the string variable. It was a mistake.

UPDATE 2: Single character motif doesn't count. For example: in ACGUGAAAGUC, the 'A' motif is not taken into account.

I think they use something called `suffix tree` for this ... but its not super simple... and everytime i start doing it i just quit about half way https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/suffixtrees.pdf — Joran Beasley, Apr 29 '15 at 21:46
What do you count as *"repetitive motifs"*? If `'GT'` counts, why not e.g. `'**'`, `'TT'` or `'CC'`? And what exactly is your question? — jonrsharpe, Apr 29 '15 at 21:49
`'ACGT'` doesn't repeated 2 times ,`ACGTACGTA` has a `A` at end !! — Mazdak, Apr 29 '15 at 21:55
It repeats 2 times, but its entire block between `**` pairs isn't a totally repetitive string. Which definition is important here? — TigerhawkT3, Apr 29 '15 at 22:01
What is the desired output if one substring repeats itself in multiple "regions" of the string? — Shashank, Apr 30 '15 at 02:32

Mazdak · Accepted Answer · 2017-10-03T10:15:15.490

you can use a recursion function as following :

Note: The result argument will be treated as a global variable (because passing mutable object to the function affects the caller)

import re
def finder(st,past_ind=0,result=[]):
   m=re.search(r'(.+)\1+',st)
   if m:
      i,j=m.span()
      sub=st[i:j]
      ind = (sub+sub).find(sub, 1)
      sub=sub[:ind]
      if len(sub)>1:
        result.append([sub,(i+past_ind+1,j+past_ind+1)])
      past_ind+=j
      return finder(st[j:],past_ind)
   else:
      return result



s='AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
print finder(s)

result:

[['ACGT', (5, 13)], ['GT', (19, 25)], ['TATACG', (29, 41)]]

answer to previous question for following string :

s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'

You can use both answers from mentioned question and some extra recipes :

First you can split the string with ** then create a new list contain the repeated strings with r'(.+)\1+' regex :

So the result will be :

>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> new
['AAA', 'ACGTACGT', 'TT', 'GTGTGT', 'CCCC', 'TATACGTATACG', 'TTT']

Note about 'ACGTACGT' that missed the A at the end!

Then you can use principal_period's function to get the repeated sub strings :

def principal_period(s):
    i = (s+s).find(s, 1, -1)
    return None if i == -1 else s[:i]

>>> for i in new:
...    p=principal_period(i)
...    if p is not None and len(p)>1:
...        l.append(p)
...        sub.append(i)
...

So you will have the repeated strings in l and main strings in sub :

>>> l
['ACGT', 'GT', 'TATACG']
>>> sub
['ACGTACGT', 'GTGTGT', 'TATACGTATACG']

Then you need a the region that you can do it with span method :

>>> for t in sub:
...    regons.append(re.search(t,s).span())

>>> regons
[(6, 14), (24, 30), (38, 50)]

And at last you can zip the 3 list regon,sub,l and use a dict comprehension to create the expected result :

>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}

The main code :

>>> s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
>>> sub=[]
>>> l=[]
>>> regon=[]
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> for i in new:
...    p=principal_period(i)
...    if p is not None and len(p)>1:
...        l.append(p)
...        sub.append(i)
... 

>>> for t in sub:
...    regons.append(re.search(t,s).span())
... 
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}

What is the complexity (big _O_ notation) of this solution ? — Sylvain Leroux, Apr 29 '15 at 22:22
My apologies. There are not asterisks into the string. It was a mistake. — Ivan Castro, Apr 29 '15 at 22:26
@SylvainLeroux I think the worst part is creating the `new` that is exponential! — Mazdak, Apr 29 '15 at 22:29
I don't think a regex solution would be exponential, probably it's polynomial time. — Shashank, Apr 30 '15 at 00:37
@Shashank Maybe! its based on the regex engine and the regular expression that you use! — Mazdak, Apr 30 '15 at 11:00
Hi @Kasramvd, I get a issue using your first proposal. When I execute finder several times, it comes with the solution of previous usage. For example, when I use finder the first time, I got [['ACG', (1,13)]]. And if I use it again with other string, I got [['ACG', (1,13)], ['CCG', (4,13)]]. Note that the result from the first usage comes here along the.result of the second.usage of finder. — Ivan Castro, Sep 23 '15 at 02:00

score 1 · Answer 2 · answered Apr 30 '15 at 02:11

If you can bound your query then you can use a single pass of the string. The number of comparisons will be length of string * (max_length - min_length) so will scale linearly.

s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

def find_repeats(s, max_length, min_length=2):

    for i in xrange(len(s)):
        for j in xrange(min_length, max_length+1):
            count = 1
            while s[i:i+j] == s[i+j*count:i+j*count+j]: count += 1
            if count > 1:
                yield s[i:i+j], i, count

for pattern, position, count in find_repeats(s, 6, 2):
    print "%6s at region (%d, %d), %d repeats" % (pattern, position, position + count*len(pattern), count)

Output:

    AC at region (2, 6), 2 repeats
  ACGT at region (4, 12), 2 repeats
  CGTA at region (5, 13), 2 repeats
    GT at region (18, 24), 3 repeats
    TG at region (19, 23), 2 repeats
    GT at region (20, 24), 2 repeats
    CC at region (24, 28), 2 repeats
    TA at region (28, 32), 2 repeats
TATACG at region (28, 40), 2 repeats
ATACGT at region (29, 41), 2 repeats
    TA at region (34, 38), 2 repeats

Note that this catches a fair few more overlapping patterns than the regexp answers, but without knowing more about what you consider a good match it is difficult to reduce it further, for example why is TATACG better than ATACGT?

Extra: Using a dict to return matches is a bad idea as the patterns are not going to be unique.

chapelo · Answer 3 · 2015-04-30T01:32:10.837

0

This simple while loop detects all repeated patterns:

def expand():
    global hi
    hi += 1

def shrink():
    global lo
    lo += 1

s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

motifs = set()

lo = 0
hi = 0

f = expand

while hi <= len(s):
    sub = s[lo : hi+1]
    if s.count(sub) > 1:
        motifs.add(sub)
        if lo==hi: f = expand
        f()
    else:
        f = shrink if lo<=hi else expand
        f()

At this point, motifs contains all the repeated patterns... Let's filter them with some criteria:

minlen = 3
for m in filter(lambda m: len(m)>=minlen and s.count(2*m)>=1, motifs):
    print(m)

'''
ATACGT
ACGT
TATACG
CGTA
'''

edited Apr 30 '15 at 01:32

answered Apr 30 '15 at 00:29

chapelo

2,519
13
19

I think you are on to something, but the issue is that count finds non-connected repetitions. e.g. `'hellobyehello'.count('hello') == 2` – Shashank Apr 30 '15 at 00:42
@Shashank, If you neet only connected repetitions, you can filter them once you have them all. See edited content. – chapelo Apr 30 '15 at 01:34

Shashank · Answer 4 · 2015-04-30T13:10:03.120

You can use the fact that in regex, lookaheads do not advance the primary iterator. Thus, you can nest a capture group within a lookahead to find the (potentially overlapping) patterns that repeat and have a specified minimum length:

>>> import re
>>> s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
>>> re.findall(r'(?=(.{2,})\1+)', s)
['AC', 'ACGT', 'CGTA', 'GT', 'TG', 'GT', 'CC', 'TATACG', 'ATACGT', 'TA']
>>> re.findall(r'(?=(.{2,}?)\1+)', s)
['AC', 'ACGT', 'CGTA', 'GT', 'TG', 'GT', 'CC', 'TA', 'ATACGT', 'TA']

Note the slightly different results between using a greedy and a non-greedy quantifier. The greedy quantifier searches for the longest repeating substring starting from every index in the original string, if one exists. The non-greedy quantifier searches for the shortest of the same. The limitation is that you can only get a maximum one pattern per starting index in the string. If you have any ideas to solve this problem, let me know! Potentially, we can use the greedy quantifier regex to set up a recursive solution that finds every repeating pattern starting from each index, but let's avoid "premature computation" for now.

Now if we take the regex (?=(.{2,})\1+) and modify it, we can also capture the entire substring that contains repeated motifs. By doing this, we can use the span of the matches to calculate the number of repetitions:

(?=((.{2,})\2+))

In the above regex, we have a capture group inside a capture group inside a lookahead. Now we have everything we need to solve the problem:

def repeated_motifs(s):
    import re
    from collections import defaultdict
    rmdict = defaultdict(list)
    for match in re.finditer(r'(?=((.{2,})\2+))', s):
        motif = match.group(2)
        span1, span2 = match.span(1), match.span(2)
        startindex = span1[0]
        repetitions = (span1[1] - startindex) // (span2[1] - startindex)
        others = rmdict[motif]
        if not others or startindex > others[-1]['region'][1]:
            others.append({'repeat': repetitions, 'region': span1})
    return rmdict


s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
d = repeated_motifs(s)
print(d)
# list of the repeating motifs we have found sorted by first region
print(sorted(list(d.keys()), key=lambda k: d[k][0]['region']))

Because desired behavior in the situation where a motif repeats itself in multiple "regions" of the string was not specified, I have made the assumption that OP would like a dictionary of string->list where each list contains its own set of dictionaries.

A more complex version of "How can I tell if a string repeats itself in Python?"

4 Answers4

answer to previous question for following string :

Linked