-1

I currently working on CS50 problem set https://cs50.harvard.edu/x/2021/psets/6/dna/

The problem simply tell us to find some DNA sequence that repeated consecutively in a txt file and match the total length with the person in csv file.

This is the code i currently work (not complete yet):

import re, csv, sys

def main(argv):
    # Open csv file
    csv_file = open(sys.argv[1], 'r')
    str_person = csv.reader(csv_file)
    
    nucleotide = next(str_person)[1:]
    
    # Open dna sequences file
    txt_file = open(sys.argv[2], 'r')
    dna_file = txt_file.read()
    
    str_repeat = {}
    str_list = find_STRrepeats(str_repeat, nucleotide, dna_file)
        

def find_STRrepeats(str_list, nucleotide, dna):
    for STR in nucleotide:
        groups = re.findall(rf'(?:{STR})+', dna)
        if len(groups) == 0:
            str_list[STR] = 0
        else:
            str_list[STR] = groups
    print(str_list)


if __name__ == "__main__":
   main(sys.argv[1:])

Output (from the print(str_list)):

{'AGATC': ['AGATCAGATCAGATCAGATC'], 'AATG': ['AATG'], 'TATC': ['TATCTATCTATCTATCTATC']}

But as you can see, the value in the dictionary also store consecutively. If i want to use len function in str_list[STR] = len(groups) it will result 1 for each key in dictionary. Because i want to find how many time (total length) that DNA repeated, and store it as value in my dict.

So, I want it to store separately. Kind of like this:

{'AGATC': ['AGATC', 'AGATC', 'AGATC', 'AGATC'], 'AATG': ['AATG'], 'TATC': ['TATC', 'TATC', 'TATC', 'TATC', 'TATC']}

What should i add to my code so they can separate with a coma like that? or maybe there's some condition i can add to my ReGex code groups = re.findall(rf'(?:{STR})+', dna) ?

I don't wanna change the whole of ReGex code. Because i think is already useful to found largest length of string that repeat consecutively. And i proud to myself can get it without help because i'm beginner with python. Please. Thank you.

martineau
  • 119,623
  • 25
  • 170
  • 301
  • Only PyPi regex has capture stack for each group support. So, `pip install regex` first, then use `import regex` and `groups = [x.captures(1) for x in regex.finditer(rf'({STR})+', dna)]` – Wiktor Stribiżew Feb 12 '21 at 09:38

1 Answers1

0

I would just store the highest number of repetitions:

    ...
    if len(groups) == 0:
        str_list[STR] = 0
    else:
        str_list[STR] = max(len(i)/len(str) for i in groups)
    ....

BTW, this would correctly handle the case where more than one sequence exists.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • But can you explain to me how actually this `max(len(i)/len(str) for i in groups)` works ? escpecially what `max` do? thx. –  Feb 12 '21 at 11:16
  • `max` retrieve the maximum value of an iterable, and `(len(i)/len(str) for i in groups)` is a (generator) iterable giving the number of repetition per *group* (the size of the group divided by de size of the pattern is of course the number of repetition) – Serge Ballesta Feb 12 '21 at 12:39