I really cannot figure out how to fix so that my code counts the maximum amount of times a pattern appears in a row in a string. I have tried google etc, but all of the answers doesn't really match what I am looking for. Perhaps I am just googling the wrong thing. Anyways here is my issue:
I have a long text file that contains random DNA sequences and i have converted it into a string and with that i am supposed to find the certain different DNA sequences and count the highest amount of times each appears consecutively. To better explain the problem I'm pasting the code I am currently trying to use.
# Opening sequence.txt and making it to a string
seqfile = open(sequence, "r")
seqfile = seqfile.read().replace("\n", "")
# Regex for each STR
pattern1 = r"AGATC"
pattern2 = r"TTTTTTCT"
pattern3 = r"AATG"
pattern4 = r"TCTAG"
pattern5 = r"GATA"
pattern6 = r"TATC"
pattern7 = r"GAAA"
pattern8 = r"TCTG"
# 3 lists to store value for the loop. Whereas outercount is the final value of each amount of STR corresponding data list
outercount = [0, 0, 0, 0, 0, 0, 0, 0]
innercount = [0, 0, 0, 0, 0, 0, 0, 0]
secondcount = [0, 0, 0, 0, 0, 0, 0, 0]
# Looping through the sequence and checking if pattern matches, if it does update secondcounter by 1 and continue...
for i in seqfile:
if re.match(pattern1, seqfile):
secondcount[0] += 1
elif re.match(pattern2, seqfile):
secondcount[1] += 1
elif re.match(pattern3, seqfile):
secondcount[2] += 1
elif re.match(pattern4, seqfile):
secondcount[3] += 1
elif re.match(pattern5, seqfile):
secondcount[4] += 1
elif re.match(pattern6, seqfile):
secondcount[5] += 1
elif re.match(pattern7, seqfile):
secondcount[6] += 1
elif re.match(pattern8, seqfile):
secondcount[7] += 1
# Looping through outercount and checking if certain value at innercount is less than secondcount update values.
for i in outercount:
if secondcount[i] > innercount[i]:
#stop counting
innercount[i] = secondcount[i]
# Reset secondcounts value so that it doesn't continue counting if it is not consecutively
secondcount[i] = 0
# Checking if innercount is greater than outercount, if it is set outercount[i] to equal innercount[i] value
if innercount[i] > outercount[i]:
outercount[i] = innercount[i]
Here is an example of how the sequencetext file might look:
TGGTTTAGGGCCTATAATTGCAGGACCACTGGCCCTTGTCGAGGTGTACAGGTAGGGAGCTAAGTTCGAAACGCCCCTTGGTCGGGATTACCGCCAGATCAGATC...
Mind you it is way more text than this but that's just for reference. And so in this text I am supposed to find up to 8 different DNA sequences and the amount they appear in a row. So for example, look for the pattern AGATC, then count the highest amount of times it appears in a row. If it appears 3 times first somewhere in the text and then 6 times later down, then my counter for AGATC should state 6, since its the highest amount in a row.
So, to explain my code: I had the idea of having 3 different arrays which I guess is not the most scalable solution since it can be either 3 or 8 different patterns in the text. But I thought that starting with the largest amount it might be easier to then figure out the rest. So what i tried to do was to make a regex for each different pattern then check if each pattern could be find in the text and if it could i would update second count list to each corresponding index.
Then with another loop compare if the amount at secondcount[i] was greater than innercount[i] and if it was update the value to innercount, and then reset the secondcount[i] because presumably that would be the end of the amount of consecutive times it appeared, and then if it would appear later again in the string then it would start counting from 0 etc... I guess the code is not so hard to understand, but well it doesnt work so... XD
Does anyone have some ideas on how I could implement this?