SeqIO.parse Biopython - which file format should I specify?

Question

I am trying to extract information from a multi-fasta file (e.g. C/G/A/T count, CG%) using biopython. I keep running into trouble when I try to iterate over the file for each fasta sequence - I can only get the first one to print out.

I suspect it may be something to do with my file format as it is not an actual fasta file, but I don't know how I can change this.

input_file = open("inputfile.fa", 'r')
output_file = open('nucleotide_counts.txt','w')
output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')

#count nucleotides in this record..gene_name = cur_record.name
from Bio import SeqIO
for cur_record in SeqIO.parse(input_file, "fasta"):
    gene_name = cur_record.name 
    A_count = cur_record.seq.count('A')
    C_count = cur_record.seq.count('C')
    G_count = cur_record.seq.count('G')
    T_count = cur_record.seq.count('T')
    length = len(cur_record.seq)
    cg_percentage = (float(C_count + G_count) / length)*100
    output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
    (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
    output_file.write(output_line)
output_file.close()
input_file.close()

This is what my multifasta looks like (with start and end specified)

>1:start-end
CGCCCCAGTGATGTAGCCGAA
>1:start-end
CGGCCACCCCGAAGCGTGGGG

And my output file only contains one row:

Gene    A       C       G       T       Length  CG%
1:start-end 85      115     180     59      439     67.198178

At first glance, I cannot see any obvious reason why your code won't work. But I can see that your input and output does not match. In those two fasta records there are not 85 As. Could you please post sample input and expected output for that input? — Poshi, Apr 11 '19 at 13:38
I read your file had a different format, so I'll actually remove my answer. Seems to me also that that should work — cnluzon, Apr 11 '19 at 13:43
Although I haven't posted the correct input and output here I checked and it is giving the correct values for the first sequence, so I guess it must be something wrong with how I have done the looping? — egeorgia, Apr 11 '19 at 14:23
Cannot reproduce this- your current code is not the most elegant but it *does* work as it does iterate over all records as you want. Before your edit the indentation was wrong however so check this in your *actual* code — Chris_Rands, Apr 11 '19 at 14:29
You might want to include multiprocessing at the step op counting amino-acids. This a labour intense work if it become large fasta files to walk-through. Look at my answer [here](https://stackoverflow.com/a/47535397/8928024) how you can do such thing. — ZF007, Apr 12 '19 at 20:39
Your multifaste file seems to be in error. The ID "1" and ID"1" used for two different sequences will mashup your results because they are now merged into the answer instead of separate answers. I reckon this either a typo or a multifasta compile error. Please check which of the two. If its the latter there is more to check prior compiling. — ZF007, Apr 12 '19 at 20:45
@egeorgia "Thanks everyone for your help, I have figured it out now!"-- good, then write an answer! This will help future people with the same question, your question is useless to future readers without the answer — Chris_Rands, Apr 15 '19 at 11:44

score 0 · Answer 1 · answered Dec 31 '20 at 15:00


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Dec 31 15:11:53 2020

@author: Pietro
"""
input_file = 'fasta'


output_file_name = input_file+'out'


#count nucleotides in this record..gene_name = cur_record.name
from Bio import SeqIO

output_file = open(output_file_name, 'w+')
output_file.write(('Gene\tA\tC\tG\tT\tLength\tCG%\n'))

for cur_record in SeqIO.parse(input_file, "fasta"):
    gene_name = cur_record.name 
    A_count = cur_record.seq.count('A')
    C_count = cur_record.seq.count('C')
    G_count = cur_record.seq.count('G')
    T_count = cur_record.seq.count('T')
    length = len(cur_record.seq)
    cg_percentage = (float(C_count + G_count) / length)*100
    output_line = str('%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage))

    output_file.write(output_line)
output_file.close()

SeqIO.parse Biopython - which file format should I specify?

1 Answers1