0

I'm trying to create a dictionary by running through a for loop where it would have a description of a bacteria and the key being its DNA sequence. The only problem is that my variable cannot store multiple dataset and it just overwrites the first dataset, thus giving me only one output for my dictionary.

#reads a fasta file and seperates the description and dna sequences
for line in resistance_read:
    if line.startswith(">"):
        description = line
    else: 
        sequence = line

#trying to get the output from the for loop and into the dictionary
bacteria_dict = {description:sequence}

Output:

line3description
dna3sequence

However, with the following code below, I am able to get all the outputs

for line in resistance_read:
    if line.startswith(">"):
       print line
    else: 
       print line

Output:

line1description
line2description
line3description
dna1sequence
dna2sequence
dna3sequence
David
  • 25
  • 1
  • 4
  • That's not how variables work in Python (and indeed in most languages). See http://en.wikibooks.org/wiki/Python_Programming/Variables_and_Strings – Cuadue Mar 03 '15 at 22:13
  • 1
    How's the incoming file look? How do you know which description lines up with which sequence? – MasterOdin Mar 03 '15 at 22:17
  • Well my goal is that the for loop will generate multiple outputs, however, I don't know to capture all the outputs and if I assign the outputs to a variable, it will overwrite every time the loop runs. For python, I believe that the variables can be reassigned datasets, they would just overwrite to the latest one. – David Mar 03 '15 at 22:29

2 Answers2

2

You're constantly overwriting the values of variables in your iterations. sequence and description only hold the last values when the iteration completes.

Instead, create the dictionary first and add to it, as a more complex data structure it can hold more data.


However, there is an easier way...

First you need to open the file and read the lines. To do that you can use the with context manager:

with open('file_path', 'r') as f:
    # used strip() to remove '\n'
    lines = [line.strip() for line in f]

Now that all the lines are in a list called lines, you want to create a mapping between descriptions and sequences.

If the description line is just over the sequence line use this slicing:

# take every other line (intervals of 2) starting from index 0
descriptions = lines[0::2]
sequences = lines[0::2]

Now use zip to zip them together and create a mapping from each pair:

result = dict(zip(descriptions, sequences))

If it's the other way around you can use this which is the exact opposite:

result = dict(zip(lines[1::2], lines[0::2]))

Edit:

Following your update, it seems like the way to do it, assuming there is a description for each sequence (exactly), is splitting the list of lines to half, and then zipping:

middle = len(lines) / 2
result = dict(zip(lines[:mid], lines[mid:]))
Community
  • 1
  • 1
Reut Sharabani
  • 30,449
  • 6
  • 70
  • 88
  • I'm confused to why we need to remove '\n'. Wouldn't that erase the lines? – David Mar 03 '15 at 23:58
  • `\n` is used to mark a new (visual) line in a string. Once you have the line as a member in a list - you don't need it anymore. – Reut Sharabani Mar 04 '15 at 00:00
  • I feel like I'm so close of grasping this but I'm still confused as what 'lines' does. – David Mar 04 '15 at 00:24
  • how about printing it? use a small file for testing first. – Reut Sharabani Mar 04 '15 at 00:25
  • Ohh okay, I see that you are creating a list but why does stripping \n allow me to store each string in? I thought with \n, each entire element is a string and we can store it that way. Without \n, wouldn't it just be one big line? I ran it and I see that removing \n does let me create a list but I still don't get it. – David Mar 04 '15 at 00:30
  • To test the effect of a certain part, try running the program without it, or reading up what it does. You'll learn much more by understanding each part of your code. Good luck. – Reut Sharabani Mar 04 '15 at 00:32
  • Ohhh okay now I get it! I never learned about creating lists this way haha. So correct me if I'm wrong but, by stripping \n, it serves as a placeholder of where to create an element each time, thus which is why there is a comma after every \n. Thank you so much!! :) – David Mar 04 '15 at 00:44
0

Based on the examples you're showing us, it looks like your file format is N lines of description followed by N lines of DNA sequence. This answer assumes that each description or DNA sequence is one line, and that there are as many sequences as there are descriptions.

If you can comfortably fit everything in memory, then the easiest way I can think of is to start as Reut Sharabani suggests above:

with open('file_path', 'r') as f:
    # used strip() to remove '\n'
    lines = [line.strip() for line in f]

Once you have lines, it's easy to create two lists, zip them up, and create a dict:

descriptions = [line for line in lines if line.startswith('>')]
sequences = [line for line in lines if not line.startswith('>')]
result = dict(zip(sequences, descriptions))

However, if the file is very large, and you don't want to do the equivalent of reading its entire length four times, you could process it only once by storing the descriptions, and matching them up with the sequences as the sequences appear.

result = {}
descriptions = []
with open('file_path', 'r') as f:

    line = f.readline().strip()

    while line.startswith('>'):
        descriptions.append(line)
        line = f.readline().strip()

    result[line] = descriptions.pop(0)
    for line in f:
        result[line] = descriptions.pop(0)

Of course this runs into trouble if:

  • there are not exactly the same number of sequences as descriptions
  • the sequences are in a different order than the descriptions
  • the sequences and descriptions are NOT in monolithic blocks.
Community
  • 1
  • 1
pcurry
  • 1,374
  • 11
  • 23