0

So I've been trying to work with Biopython and I'm fairly new. My code:

fasta_string = open("C:\\Users\\saeed\\Desktop\\dna2.fasta").read()
print('1')
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)
print('2')


blast_record = NCBIXML.read(result_handle)

len(blast_record.alignments)

E_VALUE_THRESH = 0.01
for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < E_VALUE_THRESH:
            print('*Alignment*')
            print('sequence', alignment.title)
            print('length', alignment.length)
            print(' e value', hsp.expect)
            print(hsp.query)
            print(hsp.match)
            print(hsp.sbjct)

Whenever I run this code, It prints 1 and just stops. Not stops as in exits, but keeps running and doesn't print anything else. I tried replacing the dna2.fasta file with just "myseq.fa", but that didn't seem to work either. It just said file does not exist. I want to know what I am doing wrong, and how to fix it. Any help?

SDurrani
  • 33
  • 5
  • Do you mind to show us your test sequences? Just add `print(fasta_string)` and share the output if you'd like. I guess you already know that blast takes time. A simple 50bp DNA may still take 10-15 seconds to blast. – Y. Luo Jun 04 '17 at 22:08
  • @Y.Luo When I print(fasta_string) it prints the entire dna2.fasta' file's contents, which is a lot. >gi|142022655|gb|EQ086233.1|91 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence CTCGCGTTGCAGGCCGGCGTGTCGCGCAACGACGTGTGGGGCCTGACGGGCAGGGAGGATCTCGGCGGCG CCAACTATGCGGTCTTTCGGCTCGAAAGCCAGTTCCAGACCTCCGACGGCGCGCTGACCGTGCCCGGCTC CGCATTCAGTTCGCAAGCCTACGTCGGGCTCGGCGGCGACTGGGGGACCGTGACGCTCGGGCGCCAGTTC GATTTCGTCGGCGATCTGATGCCGGCTTTCGCGATCGGCGCGAACACGCCGGCCGGCCTGCTCGCGTGGG GCTTGCCGGCGAATGCGTCGGCGGGCGGTGCGCTCGACAACCGCGTGTGGGGCGTCCAGGTGAACAATGC GGTGAAGTACGTGAGCCC – SDurrani Jun 04 '17 at 22:42
  • 1
    If you have a lot of sequences to blast, it's not surprise that takes NCBI forever to respond. If you just want to try and learn how it works, I gave a short example here. Hopefully that's helpful. – Y. Luo Jun 04 '17 at 23:57

2 Answers2

0

This is what I have to do to qblast a sequence via BioPython:

import ssl  # monkey patch for BioPython 1.68 & 1.69
ssl._create_default_https_context = ssl._create_unverified_context

from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
from Bio import SeqIO

E_VALUE_THRESH = 0.01

input_file_name = "C:\\Users\\saeed\\Desktop\\dna2.fasta"

fasta_object = SeqIO.read(input_file_name, format='fasta')

result_handle = NCBIWWW.qblast("blastn", "nt", fasta_object.seq)

blast_record = NCBIXML.read(result_handle)

for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < E_VALUE_THRESH:
            print('*Alignment*')
            print('sequence', alignment.title)
            print('length', alignment.length)
            print('e value', hsp.expect)
            print(hsp.query)
            print(hsp.match)
            print(hsp.sbjct)

I'd be interested in knowing if there's a better way to handle the SSL/certificate issue.

cdlane
  • 40,441
  • 5
  • 32
  • 81
  • Could you log whatever this SSL issue is over at https://github.com/biopython/biopython/issues please? – Peter Cock Jun 05 '17 at 15:36
  • 1
    @peterjc, this turns out be be an OS X issue, not biopython, and the fix comes with Python and is described in [Mac OSX python ... certificate verify failed](https://stackoverflow.com/a/42098127/5771269) which I've tried and seems to work. – cdlane Feb 21 '18 at 02:40
0

Here is my example with multiple queries in one request.

import timeit # Not necessary; just for timing the blast request.

from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

fasta_string = open("dna2.fasta").read()
# In "dna2.fasta":
# >test1
# CGCTCATGCTAAAACCACGGAGGAATGTTTGGCCTATTTTGGGGTGAGTG
# >test2
# GCCAAGTCTGCAGGAAGCTTTGAGTTCTGACATCCTTAATGACATGGAGT
#
# Or you can make a string for this simple example.
# fasta_string = ">test1\nCGCTCATGCTAAAACCACGGAGGAATGTTTGGCCTATTTTGGGGTGAGTG\n>test2\nGCCAAGTCTGCAGGAAGCTTTGAGTTCTGACATCCTTAATGACATGGAGT\n"
print(fasta_string)

a = timeit.default_timer() # Not necessary; just for timing the blast request.
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)
print(timeit.default_timer() - a) # Not necessary; just for timing the blast request.
# This takes me ~ 40 sec in one test.

# Use "parse" instead of "read" because you have lots of results (i.e., multiple query sequences)
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
    print(blast_record.alignments[0].hsps[0])

"cdlane" was correct. You might also want to use the Bio.SeqIO module to read in FASTA file. I'm sure you've read it but just in case, the relevant document is here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc87

Y. Luo
  • 5,622
  • 1
  • 18
  • 25