3

Edit 2/18: I figured out the issue. It's not the code directly, although someone has pointed out this sample I have put up is not the way I should have put it up. I apologize! The issue is the blastx results. They were not meeting the threshold set and the code was creating empty files before realizing it didn't have satisfactory results to write to the file. Thanks for your consideration.

I have been using Biopython to locally run some blastx searches, trim the DNA query to the ORF it finds, then saving the new sequence to a fasta file. Out of a batch of 450 sequences, it seems to skip 43 of them. It always skips the same 43 sequences. The skipped sequences aren't only in the beginning or end of the list, are not very short, and all sequence input is fasta format. I have checked the BLAST XML output file for several sequences that end up without writing an output sequence to a file and there are hits listed and indicated. The portion of the code that is supposed to find and write the trimmed sequence based on the blastx XML output is below.

from Bio.Blast import NCBIXML
from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature, FeatureLocation

id_list = ".../ORFS.txt"
text_file = open(id_list)
list1 = text_file.read().split('\n')
list2 = []
for items in list1:
list2.append(items)
list2.pop()

access_to_orf = {}
for orf_accession in list2:
access_split = orf_accession.split(".")[0]
access_to_orf[access_split] = orf_accession

parse_in = os.path.join(out_path, out_name)
parse_name = accession + ".trim.fasta"
parse_out = os.path.join(trim_out, parse_name)
file = open(parse_out, "w")
parsed_data = ""
result_handle = open(parse_in)
blast_records = NCBIXML.parse(result_handle)
blast_record = next(blast_records)
q_record = SeqIO.read(seq_path, "fasta")
q_parse = q_record.seq
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if (hsp.positives/alignment.length)*100 >= 95:
                fr = hsp.frame
                x, y = fr
                if x < 0:
                    dna_str = -1
                else:
                    dna_str = 1
                feature = SeqFeature(
                    FeatureLocation((hsp.query_start-1), (hsp.query_end), 
                        strand = dna_str))
                q_feature = feature.extract(q_parse)
                parsed_data += (blast_record.query + " " + str(q_feature)  +"\n")
file.write(parsed_data)
file.close()
print("Finished BLAST parse for " + ORF)

Can anyone help me figure out why it is skipping the sequences?

Edit 2/12: Added section to sample about where accession and ORF are defined.

BrianW
  • 31
  • 2
  • `print("Finished BLAST parse for " + ORF)`, but `ORF` is not defined in your sample. Is this the whole sample or are you accidently printing `None`? – Frank Bryce Feb 11 '16 at 20:44
  • ORF is defined further up in the process as part of a dictionary. I will edit the sample to reflect that. – BrianW Feb 12 '16 at 21:00
  • There are multiple problems with your script -- it is unrunnable as is, because your indentation is all wrong. Apart from all that, I don't see that you have defined "trim_out" anywhere, and you are using "out_path" (also undefined) for "parse_in" which is confusing and probably wrong. If these aren't the issues, please revise the code you include to make a [MCVE](http://stackoverflow.com/help/mcve). – iayork Feb 17 '16 at 13:45
  • 1
    Those weren't the issues. Sorry I did not realize there was MCVE. I will keep that in mind next time I ask a question. I did figure out what the issue was though and will be updating my question now to reflect that. Thanks for your help! – BrianW Feb 18 '16 at 14:32
  • Are you sure the 43 sequences aren't failing your conditional? `if (hsp.positives/alignment.length)*100 >= 95:` – Colin Anthony Oct 10 '19 at 08:27

0 Answers0