I am learning python and I want to parse a fasta file without using BioPython. My txt file looks like:
>22567
CGTGTCCAGGTCTATCTCGGAAATTTGCCGTCGTTGCATTACTGTCCAGCTCCATGCCCA
ACATTTGGCATCGGAGAATGACTCCGCGTGATAAAGTCAGAATAGGCATTGAGACTCAGG
GTGGTACCTATTA
>34454
AAAACTGTGCAGCCGGTAACAGGCCGCGATGCTGTACTATATGTGTTTGGTACATATCCG
ATTCAGGTATGTCAGGGAGCCAGCACCGGAGGATCCAGAAGTAAGTCGGGTTGACTACTC
CTAGCCTCGTTTCACCATCCGCCGGATAACTCTCCCTTCCATCATCAACTCCTCCCTTTC
GTGTCCAATGGGGCGGCGTGTCTAAGCACTGCCATATAGCTACCGAAAGGCGGCGACCCC
TCGGA
I would like to parse this to save the headers of each sequence, which are >22567 and >34454 into a headers list (this is working). And after each header read following sequence into a sequences list.
The output, I would like to look like:
headers = ['>22567','>34454']
sequences = ['CGTGTCCAGGTCTATCTCGGAAATT...', AAAACTTTGTGAAAA....']
The problem I have is when I try to read the sequences part, I can't figure out how to concatenate each line into one sequence string before appending it into a list. Instead what I have is each line appending to the sequence list.
The code I have so far is:
#!/usr/bin/python
import re
dna = []
sequences = []
def read_fasta(filename):
global seq, header, dna, sequences
#open the file
with open(filename) as file:
seq = ''
#forloop through the lines
for line in file:
header = re.search(r'^>\w+', line)
#if line contains the header '>' then append it to the dna list
if header:
line = line.rstrip("\n")
dna.append(line)
# in the else statement is where I have problems, what I would like is
#else:
#the proceeding lines before the next '>' is the sequence for each header,
#concatenate these lines into one string and append to the sequences list
else:
seq = line.replace('\n', '')
sequences.append(seq)
filename = 'gc.txt'
read_fasta(filename)