1

I've been trying to write a fasta parser which takes a fasta text file (DNA) as an input and outputs an AA sequence and I'm only using the biopython SeqIO module for parsing the input fasta file.

I get my desired output but the problem is that whenever I run the code, I keep getting blank space on the top of my output fasta file and I really want to remove this.

I've been searching through webs but nothing has worked for me so far.

Below is the code that I have so far.

from Bio import SeqIO
CONST_CODON = {'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
               'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
               'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
               'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
               'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
               'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
               'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
               'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',
               'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
               'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
               'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
               'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',
               'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G',
               'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G',
               'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G',
               'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'
               }

def DNA2Prot(f1, f2="translated_fasta.txt"):
    with open(f1, 'r') as fin, open(f2, 'w') as fout:
        for seq_record in SeqIO.parse(f1,'fasta'):
            sequence = seq_record.seq
            sequence = sequence.lower()
            fout.write('\n'+seq_record.description)
            fout.write('\n')
            for i in range(0,len(sequence),3):
                if sequence[i:i+3] in CONST_CODON:
                    amino_acid = CONST_CODON[str(sequence[i:i+3])]
                    fout.write(amino_acid)



if __name__ == "__main__":
    test = DNA2Prot('test_fasta.txt')
    print test

My current output looks like this.

-----------------blank space--------------
BCB2141
IG*R*SRRESLYSD
BCA2111
MA*SRVEL*GTASSCRRAVEPI*EP
BCA2112
IEPRWVWPV*SPIEPIEIESR*SLRDPRCDAD

My desired output is:

BCB2141
IG*R*SRRESLYSD
BCA2111
MA*SRVEL*GTASSCRRAVEPI*EP
BCA2112
IEPRWVWPV*SPIEPIEIESR*SLRDPRCDAD
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
Danny
  • 87
  • 2
  • 7
  • 1
    `fout.write('\n'+seq_record.description)` is the reason for blank line. Skip `\n` here for the first line. Alternatively, build up the line contents and use `str.join`. Or, store all the file contents (if possible) to a variable and just use `str.strip` before writing to the file. – mshsayem Nov 24 '15 at 04:12
  • @ mshsayem , I know that but I need those to separate fasta header from fasta sequence in different line – Danny Nov 24 '15 at 04:13
  • yes thats because you have given in the very "blank space " in your first sentence by writting '\n' before the print mesage........remove it and done.......write the blank space after the end of that statement.....n c wat happens may be dat will tell ya abt more of \n behaviour – Code Man Nov 24 '15 at 04:14
  • @CodeMan , when removing it, my fasta header and amino acid sequence will be in same line.... – Danny Nov 24 '15 at 04:15

2 Answers2

3

You are starting with a blank line, so it prints a blank line. If you want a blank line as a separator, include it at the end:

fout.write(seq_record.description + '\n') # no more leading newline
# fout.write('\n') # moved to above
for i in range(0,len(sequence),3):
    if sequence[i:i+3] in CONST_CODON:
        amino_acid = CONST_CODON[str(sequence[i:i+3])]
        fout.write(amino_acid)
fout.write('\n')

Note that this will result in a blank line at the end, which may be more acceptable to you. The alternative would require you to know when you get to the last entry and then not put a newline after it.

TigerhawkT3
  • 48,464
  • 6
  • 60
  • 97
1

The culprit is the line: fout.write('\n'+seq_record.description)

This will prepend a newline to every sequence record description line, including the first one. One solution is to change to

fout.write(seq_record.description)

and then just add fout.write('\n') after the inner for loop. Of course this will make it so your file ends in a newline, but that's POSIX standard anyways.

Community
  • 1
  • 1
lemonhead
  • 5,328
  • 1
  • 13
  • 25