Counting the maximum amount a pattern appears consecutively in a string

Question

I really cannot figure out how to fix so that my code counts the maximum amount of times a pattern appears in a row in a string. I have tried google etc, but all of the answers doesn't really match what I am looking for. Perhaps I am just googling the wrong thing. Anyways here is my issue:

I have a long text file that contains random DNA sequences and i have converted it into a string and with that i am supposed to find the certain different DNA sequences and count the highest amount of times each appears consecutively. To better explain the problem I'm pasting the code I am currently trying to use.

# Opening sequence.txt and making it to a string
seqfile = open(sequence, "r")
seqfile = seqfile.read().replace("\n", "")

# Regex for each STR
pattern1 = r"AGATC"
pattern2 = r"TTTTTTCT"
pattern3 = r"AATG"
pattern4 = r"TCTAG"
pattern5 = r"GATA"
pattern6 = r"TATC"
pattern7 = r"GAAA"
pattern8 = r"TCTG"

# 3 lists to store value for the loop. Whereas outercount is the final value of each amount of STR corresponding data list

outercount = [0, 0, 0, 0, 0, 0, 0, 0]
innercount = [0, 0, 0, 0, 0, 0, 0, 0]
secondcount = [0, 0, 0, 0, 0, 0, 0, 0]

# Looping through the sequence and checking if pattern matches, if it does update secondcounter by 1 and continue...
for i in seqfile:
    if re.match(pattern1, seqfile):
        secondcount[0] += 1
    elif re.match(pattern2, seqfile):
        secondcount[1] += 1
    elif re.match(pattern3, seqfile):
        secondcount[2] += 1
    elif re.match(pattern4, seqfile):
        secondcount[3] += 1
    elif re.match(pattern5, seqfile):
        secondcount[4] += 1
    elif re.match(pattern6, seqfile):
        secondcount[5] += 1
    elif re.match(pattern7, seqfile):
        secondcount[6] += 1
    elif re.match(pattern8, seqfile):
        secondcount[7] += 1

# Looping through outercount and checking if certain value at innercount is less than secondcount update values.
for i in outercount:
        if secondcount[i] > innercount[i]:
        #stop counting
        innercount[i] = secondcount[i]
    # Reset secondcounts value so that it doesn't continue counting if it is not consecutively
    secondcount[i] = 0
    # Checking if innercount is greater than outercount, if it is set outercount[i] to equal innercount[i] value
    if innercount[i] > outercount[i]:
        outercount[i] = innercount[i]

Here is an example of how the sequencetext file might look:

TGGTTTAGGGCCTATAATTGCAGGACCACTGGCCCTTGTCGAGGTGTACAGGTAGGGAGCTAAGTTCGAAACGCCCCTTGGTCGGGATTACCGCCAGATCAGATC...

Mind you it is way more text than this but that's just for reference. And so in this text I am supposed to find up to 8 different DNA sequences and the amount they appear in a row. So for example, look for the pattern AGATC, then count the highest amount of times it appears in a row. If it appears 3 times first somewhere in the text and then 6 times later down, then my counter for AGATC should state 6, since its the highest amount in a row.

So, to explain my code: I had the idea of having 3 different arrays which I guess is not the most scalable solution since it can be either 3 or 8 different patterns in the text. But I thought that starting with the largest amount it might be easier to then figure out the rest. So what i tried to do was to make a regex for each different pattern then check if each pattern could be find in the text and if it could i would update second count list to each corresponding index.

Then with another loop compare if the amount at secondcount[i] was greater than innercount[i] and if it was update the value to innercount, and then reset the secondcount[i] because presumably that would be the end of the amount of consecutive times it appeared, and then if it would appear later again in the string then it would start counting from 0 etc... I guess the code is not so hard to understand, but well it doesnt work so... XD

Does anyone have some ideas on how I could implement this?

score 1 · Accepted Answer · answered Jan 19 '20 at 19:26

1

Assuming a pattern can occur multiple times in a row, I'd proceed as follows to calculate the max consecutive repetitions of a pattern in a sequence across all sequences.

import re

with open(sequence_file, 'rt') as f:
    rows = f.readlines()

patterns = { 
    re.compile("AGATC"): 0,
    re.compile("TCTAG"): 0,
    ... 
}

for r in rows:
    for p in patterns:
        prev_end = 0
        freq = 0
        for m in p.finditer(r):
            span = m.span()
            if span[0] != prev_end:
                patterns[p] = max(freq, patterns[p])
                freq = 0

            prev_end = span[1]
            freq += 1

        if freq:
            patterns[p] = max(freq, patterns[p])

Note: I haven't tested this code. So, please test it with known inputs before using it.

answered Jan 19 '20 at 19:26

Venkatesh-Prasad Ranganath

1,776
11
19

Omg thank you sooo much! This worked perfectly, the only thing i had to add was to make a list out of the values in the dictionary and now the program works perfectly. Thaank youu again! Have a great day :) edit: Only made a list out of the values for another use in the code – Mango88 Jan 19 '20 at 19:54
@Mango88 Glad it helped. If a solution works for you, then accept/up-vote the answer that provided the solution to indicate the answer/solution is valid. This helps others having similar questions/looking for similar solutions. – Venkatesh-Prasad Ranganath Jan 19 '20 at 20:26
1

I suggest having two dictionaries, one for the patterns, one for the counts, both indexed by strings; that will generally be more useful than indexing a dictionary by regex patterns, since you can't easily look up a value by key in that case. – kaya3 Jan 19 '20 at 20:53
@Venkatesh-PrasadRanganath Ahh yess! But i am not sure if the upvote displays since i have less than 15 rep :7 – Mango88 Jan 20 '20 at 16:49
Interesting, I didn't realize that :) Out of curiosity, what about accepting an answer? – Venkatesh-Prasad Ranganath Jan 20 '20 at 17:32
Oh yes that worked. didnt even realize there was such thing until now. Just did it :=) – Mango88 Jan 21 '20 at 16:53

alissongranemann · Answer 2 · 2020-01-20T10:13:26.880

0

Here's my solution:

import re

patterns = {"AGATC": 0, "TTTTTTCT": 0, "AATG": 0, "TCTAG": 0, ...}

with open(sequence, 'rt') as file:
    rows = file.readlines()

    for row in rows:
        for pattern in patterns:
            regex = r"({0}(?:{0})+)".format(pattern) # any consecutive sequence
            results = re.findall(regex, value) # list of consecutive sequences
            if results:
                longest_sequence = sorted(results, reverse=True)[0]
                count = len(longest_sequence) / len(pattern) # count the number of ocurrences
                patterns[pattern] = max(int(count), patterns [pattern])

An example of the regex would be (AGATC(?:AGATC)+), meaning: find the word AGATC proceeded by AGATC one or more times (+). The ?: is the non-capture group, so that the findall returns only one group - the entire match.

edited Jan 20 '20 at 10:13

answered Jan 19 '20 at 20:28

alissongranemann

964
9
12

2

Shouldn't `+` be `*` to account for the case when a pattern occurs in isolation? – Venkatesh-Prasad Ranganath Jan 20 '20 at 00:25
1

No, because it needs at least another sequence to be consecutive, with 2 sequences in a row. – alissongranemann Jan 20 '20 at 00:48
1

The regex `r'(a(?:a)+)'` would report empty match for string `'a'`. So, if we are looking for isolated occurrences or consecutive occurrences, then `*` is needed. I suspect this is the case here; may be, I am wrong. The OP can clarify. – Venkatesh-Prasad Ranganath Jan 20 '20 at 00:59
Shouldn't the last line be `patterns[pattern] = max(int(count), patterns[pattern])`? – Venkatesh-Prasad Ranganath Jan 20 '20 at 01:00
If that's the case, I agree. But what I got of his problem is that he doesn't want to take into account isolated occurrences (maybe I'm wrong too). – alissongranemann Jan 20 '20 at 01:32
I'm already getting the longest sequence, so max would not be necessary. – alissongranemann Jan 20 '20 at 01:33
1

`max` is needed to get the longest sequence across all rows. Without `max`, 4 occurrences in row 3 will reported instead of the 40 occurrences in row 2. – Venkatesh-Prasad Ranganath Jan 20 '20 at 01:47
Oh I see, you're right. I edited my answer, thanks! – alissongranemann Jan 20 '20 at 10:14

Counting the maximum amount a pattern appears consecutively in a string

2 Answers2