Edit 2/18: I figured out the issue. It's not the code directly, although someone has pointed out this sample I have put up is not the way I should have put it up. I apologize! The issue is the blastx results. They were not meeting the threshold set and the code was creating empty files before realizing it didn't have satisfactory results to write to the file. Thanks for your consideration.
I have been using Biopython to locally run some blastx searches, trim the DNA query to the ORF it finds, then saving the new sequence to a fasta file. Out of a batch of 450 sequences, it seems to skip 43 of them. It always skips the same 43 sequences. The skipped sequences aren't only in the beginning or end of the list, are not very short, and all sequence input is fasta format. I have checked the BLAST XML output file for several sequences that end up without writing an output sequence to a file and there are hits listed and indicated. The portion of the code that is supposed to find and write the trimmed sequence based on the blastx XML output is below.
from Bio.Blast import NCBIXML
from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature, FeatureLocation
id_list = ".../ORFS.txt"
text_file = open(id_list)
list1 = text_file.read().split('\n')
list2 = []
for items in list1:
list2.append(items)
list2.pop()
access_to_orf = {}
for orf_accession in list2:
access_split = orf_accession.split(".")[0]
access_to_orf[access_split] = orf_accession
parse_in = os.path.join(out_path, out_name)
parse_name = accession + ".trim.fasta"
parse_out = os.path.join(trim_out, parse_name)
file = open(parse_out, "w")
parsed_data = ""
result_handle = open(parse_in)
blast_records = NCBIXML.parse(result_handle)
blast_record = next(blast_records)
q_record = SeqIO.read(seq_path, "fasta")
q_parse = q_record.seq
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if (hsp.positives/alignment.length)*100 >= 95:
fr = hsp.frame
x, y = fr
if x < 0:
dna_str = -1
else:
dna_str = 1
feature = SeqFeature(
FeatureLocation((hsp.query_start-1), (hsp.query_end),
strand = dna_str))
q_feature = feature.extract(q_parse)
parsed_data += (blast_record.query + " " + str(q_feature) +"\n")
file.write(parsed_data)
file.close()
print("Finished BLAST parse for " + ORF)
Can anyone help me figure out why it is skipping the sequences?
Edit 2/12: Added section to sample about where accession and ORF are defined.