I currently working on CS50 problem set https://cs50.harvard.edu/x/2021/psets/6/dna/
The problem simply tell us to find some DNA sequence that repeated consecutively in a txt file and match the total length with the person in csv file.
This is the code i currently work (not complete yet):
import re, csv, sys
def main(argv):
# Open csv file
csv_file = open(sys.argv[1], 'r')
str_person = csv.reader(csv_file)
nucleotide = next(str_person)[1:]
# Open dna sequences file
txt_file = open(sys.argv[2], 'r')
dna_file = txt_file.read()
str_repeat = {}
str_list = find_STRrepeats(str_repeat, nucleotide, dna_file)
def find_STRrepeats(str_list, nucleotide, dna):
for STR in nucleotide:
groups = re.findall(rf'(?:{STR})+', dna)
if len(groups) == 0:
str_list[STR] = 0
else:
str_list[STR] = groups
print(str_list)
if __name__ == "__main__":
main(sys.argv[1:])
Output (from the print(str_list)
):
{'AGATC': ['AGATCAGATCAGATCAGATC'], 'AATG': ['AATG'], 'TATC': ['TATCTATCTATCTATCTATC']}
But as you can see, the value in the dictionary also store consecutively. If i want to use len function in str_list[STR] = len(groups)
it will result 1 for each key in dictionary. Because i want to find how many time (total length) that DNA repeated, and store it as value in my dict.
So, I want it to store separately. Kind of like this:
{'AGATC': ['AGATC', 'AGATC', 'AGATC', 'AGATC'], 'AATG': ['AATG'], 'TATC': ['TATC', 'TATC', 'TATC', 'TATC', 'TATC']}
What should i add to my code so they can separate with a coma like that? or maybe there's some condition i can add to my ReGex code groups = re.findall(rf'(?:{STR})+', dna)
?
I don't wanna change the whole of ReGex code. Because i think is already useful to found largest length of string that repeat consecutively. And i proud to myself can get it without help because i'm beginner with python. Please. Thank you.