learning to parse a fasta file with python

Question

I am learning python and I want to parse a fasta file without using BioPython. My txt file looks like:

>22567
CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCA
ACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGG
GTGGTACCTATTA
>34454
AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCG
ATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTC
CTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTC
GTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCC
TCGGA

I would like to parse this to save the headers of each sequence, which are >22567 and >34454 into a headers list (this is working). And after each header read following sequence into a sequences list.

The output, I would like to look like:

headers =  ['>22567','>34454']
sequences = ['CGTGTCCAGGTCTATCTCGGAAATT...', AAAACTTTGTGAAAA....']

The problem I have is when I try to read the sequences part, I can't figure out how to concatenate each line into one sequence string before appending it into a list. Instead what I have is each line appending to the sequence list.

The code I have so far is:

#!/usr/bin/python 

import re 

dna = []
sequences = []


def read_fasta(filename):
    global seq, header, dna, sequences 

#open the file  
    with open(filename) as file:    
        seq = ''        
        #forloop through the lines
        for line in file: 
            header = re.search(r'^>\w+', line)
            #if line contains the header '>' then append it to the dna list 
            if header:
                line = line.rstrip("\n")
                dna.append(line)            
            # in the else statement is where I have problems, what I would like is
            #else: 
                #the proceeding lines before the next '>' is the sequence for each header,
                #concatenate these lines into one string and append to the sequences list 
            else:               
                seq = line.replace('\n', '')  
                sequences.append(seq)      

filename = 'gc.txt'

read_fasta(filename)

Is it an option for you to use a library like [pyfasta](https://pypi.python.org/pypi/pyfasta/) instead or do you need/want to stick to custom code? — cyroxx, Apr 22 '15 at 18:31
well I'm doing to learn to parse files better, so it would be nice to learn with custom code — DJF, Apr 22 '15 at 18:45

score 6 · Answer 1 · edited May 23 '17 at 11:58

6

Note: I had this solution on one of my projects, so I directly pasted it here. However the solution is not mine and belongs to this poster here. Please upvote his/her answer. Thanks to @donkeykong for finding the original post

Use a list to accumulate lines until you reach a new id. Then join the lines together and store them with the id in a dictionary. The following function takes an open file and yields each pair of (id, sequence).

def read_fasta(fp):
        name, seq = None, []
        for line in fp:
            line = line.rstrip()
            if line.startswith(">"):
                if name: yield (name, ''.join(seq))
                name, seq = line, []
            else:
                seq.append(line)
        if name: yield (name, ''.join(seq))

with open('ex.fasta') as fp:
    for name, seq in read_fasta(fp):
        print(name, seq)

Output:

('>22567', 'CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCAACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGGGTGGTACCTATTA')
('>34454', 'AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCGATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTCCTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTCGTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCCTCGGA')

This was an answer on SO. I'll try and find it and give the original poster the credit.

edited May 23 '17 at 11:58

Community

1
1

answered Apr 22 '15 at 18:23

letsc

2,515
5
35
54

3

[Here's](https://stackoverflow.com/questions/7654971/parsing-a-fasta-file-using-a-generator-python) the original – miradulo Apr 22 '15 at 18:47
Ok now do I delete my answer and post the link to the original answer? I dont know what to do .... – letsc Apr 22 '15 at 18:48
@letsc I flagged it as a duplicate, but your answer is just fine as is, unless you want to more explicitly give the OP credit :) – miradulo Apr 22 '15 at 18:52
Ill let this answer stay, ill add the link to the original answer to this solution. Cant take credit for what is not mine – letsc Apr 22 '15 at 18:54
Could I ask one more question! This is almost how I need it but I need a separate list for the name and for the seq. I'm trying but I still can't figure out how to do this. – DJF Apr 22 '15 at 23:00
@djfreeze - I would suggest a dictionary for that. You can extract the name and seq from the dictionary into separate lists AND get corresponding seq , given a name – letsc Apr 22 '15 at 23:03

score 2 · Answer 2 · answered Apr 22 '15 at 18:26

2

Keep a temporary string, and append to this string each time you have a part sequence. When you get a new header, append the string to the sequence list, and set it to the empty string.

answered Apr 22 '15 at 18:26

Espen Haugsdal

136
7

Padraic Cunningham · Answer 3 · 2015-04-22T19:16:49.753

You could use groupby:

from itertools import groupby
with open("in.fasta") as f:
    groups = groupby(f, key=lambda x: not x.startswith(">"))
    d = {}
    for k,v in groups:
        if not k:
            key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
            d[key] = val

print(d)
{'>34454': 'AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCGATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTCCTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTCGTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCCTCGGA', '>22567': 'CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCAACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGGGTGGTACCTATTA'}

Or using a generator:

def grouped(fle):
    from itertools import groupby
    with open(fle) as f:
        groups = groupby(f, key=lambda x: not x.startswith(">"))
        for k, v in groups:
            if not k:
                yield list(v)[0].rstrip(), "".join(map(str.rstrip, next(groups)[1],""))


print(list(grouped("in.fasta")))
[('>22567', 'CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCAACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGGGTGGTACCTATTA'), ('>34454', 'AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCGATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTCCTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTCGTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCCTCGGA')]

learning to parse a fasta file with python

3 Answers3