I am trying to extract information from a multi-fasta file (e.g. C/G/A/T count, CG%) using biopython. I keep running into trouble when I try to iterate over the file for each fasta sequence - I can only get the first one to print out.
I suspect it may be something to do with my file format as it is not an actual fasta file, but I don't know how I can change this.
input_file = open("inputfile.fa", 'r')
output_file = open('nucleotide_counts.txt','w')
output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')
#count nucleotides in this record..gene_name = cur_record.name
from Bio import SeqIO
for cur_record in SeqIO.parse(input_file, "fasta"):
gene_name = cur_record.name
A_count = cur_record.seq.count('A')
C_count = cur_record.seq.count('C')
G_count = cur_record.seq.count('G')
T_count = cur_record.seq.count('T')
length = len(cur_record.seq)
cg_percentage = (float(C_count + G_count) / length)*100
output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
(gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
output_file.write(output_line)
output_file.close()
input_file.close()
This is what my multifasta looks like (with start and end specified)
>1:start-end
CGCCCCAGTGATGTAGCCGAA
>1:start-end
CGGCCACCCCGAAGCGTGGGG
And my output file only contains one row:
Gene A C G T Length CG%
1:start-end 85 115 180 59 439 67.198178