How to iterate over sections of a string while also analyzing the up and downstream flanking regions?

Question

I would like to make a sliding window in python that examines a sequence of DNA (that is going to be anywhere between 2000 to 4000 base pairs long) in frames of 120 base pairs. However, I also want to take into account about 20 the nucleotides flanking the up and downstream regions of the 120 base pair frame. However, if the sliding window is moves to position 14 or position 1992 in a 2000 base pair long DNA sequence, for example, then obviously either the upstream or downstream flanking regions are going to have to be less than 20 base pairs long.

So far, I have designed my code like so:

import from Bio import SeqIO
from Bio.Alphabet.IUPAC import IUPACUnambiguousDNA

fasta= SeqIO.to_dict(SeqIO.parse("RD4.fasta", "fasta", alphabet=IUPACUnambiguousDNA()))

sequence= DNA_sequence.values()[0].seq
print(sequence)
sequence= "TGTGAATTCATACAAGCCGTAGTCGTGCAGAAGCGCAACACTCTTGGAGTGGCCTACAACGGCGCTCTCCGCGGCGCGGGCGTACCGGATATCTTAGCTGGTCAATAGCCATTTTTCAGCAATTTCTCAGTAACGCTACGGG"


target_length= 120 
for position in range(len(sequence)-target_length+1):
    stop= position+target_length
    potential_target_frame= sequence[position:stop]
    potential_target_frame= str(potential_target)
    if position < 20:
        upstream_flank= sequence[:position]
        downstream_flank= sequence[stop:stop+20]
    elif len(sequence) - stop < 20:
        upstream_flank= sequence[position-20:position]
        downstream_flank= sequence[stop:]
    else:
        upstream_flank= sequence[position-20:position]
        downstream_flank= sequence[stop:stop+20]
    print("upstream flank is " + upstream_flank)
    print("downstream flank is " + downstream_flank)

While this code is ostensibly designed logically, the print functions show that there is a problem for how this code is designed– only the downstream flank is printed, not the upstream flank.

Is there a problem with how my conditional tree has been set up, or is the problem with how I am slicing my original sequence?

_"only the downstream flank is printed, not the upstream flank."_ Please clarify. Do you mean "it prints only `downstream flank is` followed by the downstream flank, and nothing else appears at all", or do you mean "it prints `upstream flank is`, followed by an empty list, followed by `downstream flank is`, followed by a non-empty list"? — Kevin, May 15 '17 at 14:03
You should be using the 3rd parameter of `range()`, the step argument — Chris_Rands, May 15 '17 at 14:08
@Kevin To clarify, I meant that only `downstream flank is ...` (where the ... represents the sequence at hand). The string `upstream flank is` is not printed, nor is its corresponding sequence at hand. — Bob McBobson, May 15 '17 at 14:18
@Chris_Rands how would the 3rd parameter (step argument) of the 'range()' argument help me in this case? I thought that the step argument only indicated the number of letters (or values) one wanted to skip over in a loop. — Bob McBobson, May 15 '17 at 14:21
Have you considered writing your code in terms of a 'sliding window' to hide some of its complexity? (*See* http://stackoverflow.com/questions/6822725/rolling-or-sliding-window-iterator-in-python.) — Bill Bell, May 15 '17 at 14:35
@BillBell well, technically I am implementing a sliding window right now with my code `range(len(sequence)-target_length+1)`. Do you think I would need to use another sliding window within the code again? — Bob McBobson, May 15 '17 at 14:48
"The string upstream flank is is not printed". That's very interesting behavior. Can you provide the contents of `sequence` and `potential_target` so I can run this code on my own machine and investigate its output? I tried a simple `sequence = "GCAT"*1000` as placeholder data, but the upstream flank prints just fine for that, so I'm guessing the problem requires your specific data. — Kevin, May 15 '17 at 14:59
It doesn't seem so. The window is 120+2*20 pairs long, if I understand correctly. You need to take the middle 120-pair slice from this, and 'consider' the 20-pair slices at either end. — Bill Bell, May 15 '17 at 15:00
@Kevin yes. I will provide an example of what the contents of sequence would be by editing my code above. — Bob McBobson, May 15 '17 at 15:01
@BillBell I should have mentioned before, the above code is a somewhat condensed version of my original code. I have a lot of other processes occurring with the original 120 base pair frame, so I would prefer to keep it in its original form. That being said, perhaps enhancing the frame to 160 base pairs would make this goal more practical, so I'll see if I can play around to make things work. — Bob McBobson, May 15 '17 at 15:27
@Kevin I have tried playing around with my data to see why the `upstream_flank` variable is not being generated. I still cannot figure out why this is the case. — Bob McBobson, May 15 '17 at 18:24
Strange, when I run that code (with a couple lines deleted because I don't have `Bio` installed), I can [see `upstream_flank` in the output just fine](http://ideone.com/OAIQjx). — Kevin, May 15 '17 at 18:28
@Kevin for some reason, my upstream flanks will only be printed when I have the line `print("upstream flank is " + upstream_flank)` outside the conditional tree. I have no idea why this is. I'm going to keep playing with the code to see why this is. I have confirmed that the `sequence` variable is only a string, so why this is occurring makes no sense. — Bob McBobson, May 16 '17 at 14:05

score 0 · Accepted Answer · answered May 18 '17 at 10:52

It turns out that I had set up the conditional tree incorrectly. Because I was dealing with two different sections of the string, and because those two sections could exist in three distinct states (either having a length greater than 20, less than 20, or equal to zero), there had to be 3^2 parts of my conditional tree. In the case where an up or downstream flank has a length of zero, I set its variable to being an empty string.

So here is how the code should have been set up (I have condensed it slightly from the code I set up above, as well as changed the way in which the up and downstream sections are calculated):

target_length= 120 
for position in range(len(sequence)-target_length+1):
    stop= position+target_length
    potential_target_frame= sequence[position:stop]
    potential_target_frame= str(potential_target)
    if len(sequence[:pos]) == 0 and len(sequence[stop:]) > 20:
        upstream_flank= " "
        downstream_flank= sequence[stop:stop+20]
        print("upstream flank is " + upstream_flank)
        print("downstream flank is " + downstream_flank)
    elif (len(sequence[:pos]) >0 and <20) and (len(sequence[stop:]) >20:
        upstream_flank= sequence[:position]
        downstream_flank= sequence[stop:stop+20]
        print("upstream flank is " + upstream_flank)
        print("downstream flank is " + downstream_flank)
    ############ 
    #####Just assume the other 5 out of 8 scenarios will be written out in elif conditions in this hash section
    ############
    else:
        upstream_flank= sequence[position-20:position]
        downstream_flank= sequence[stop:stop+20]
        print("upstream flank is " + upstream_flank)
        print("downstream flank is " + downstream_flank)

How to iterate over sections of a string while also analyzing the up and downstream flanking regions?

1 Answers1