Downloading Protein Sequences of multiple Organisms

Question

I am attempting to use biopython to download all of the proteins of a list of organisms sequenced by a specific institution. I have the organism names and BioProject's associated with each organism; specifically I am looking to analyze the proteins found in some recent genome sequences. I'd like to download the protein files in bulk, in the friendliest manner possible with efetch . My most recent attempt of downloading all of protein FASTA sequences for an associated organism is as follows:

  net_handle = Entrez.efetch(db="protein",
                             id=mydictionary["BioPROJECT"][i],
                             rettype="fasta")

There are roughly 3000-4500 proteins associated with each organism; so using esearch and trying to efetch each protein one at a time is not realistic. Plus I'd like have a single FASTA file for each organism that encompasses all of its proteins.

Unfortunately when I run this line of code, I receive the following error: urllib2.HTTPError: HTTP Error 400: Bad Request.

It appears for all of the organisms I am interested in, I can't simply find their genonome sequence in their Nucleotide databank and download the "Protein encoding Sequences"

How may obtain these protein sequences I want in a manner that won't overload the NCBI servers? I was hoping that I could replicate what I can do on NCBI's web browser: select the protein database, search for the Bioproject number, and then save all of the found protein sequences into a single fasta file (under the "Send to" drop down menu)

@redviper, please provide some info about `mydictionary` contents. The call to efetch is valid and retrieves valid values (e.g. using "12345" as `id` you get the protein at http://www.ncbi.nlm.nih.gov/protein/CAA44029.1). — xbello, Jul 14 '14 at 15:20

score 3 · Accepted Answer · answered Jun 20 '14 at 22:20

Try to download the sequence from PATRIC's FTP, which is a gold mine, first it is much better organized and second, the data are A LOT cleaner than NCBI. PATRIC is backed by NIH by the way.

PATRIC contains some 15000+ genomes and provides their DNA, protein, the DNA of protein coding regions, EC, pathway, genbank in separate files. Super convenient. Have a look yourself there:

ftp://ftp.patricbrc.org/patric2.

I suggest you download all the desired files from all organisms first and then pick up those you need once you have them all on your hard drive. The following python script download the ec number annotation files provided by PATRIC in one go (if you have proxy, you need to config it in the comment section):

from ftplib import FTP
import sys, os

#######if you have proxy

####fill in you proxy ip here
#site = FTP('1.1.1.1')

#site.set_debuglevel(1)
#msg = site.login('anonymous@ftp.patricbrc.org')

site = FTP("ftp.patricbrc.org")
site.login()
site.cwd('/patric2/current_release/ec/')

bacteria_list = []
site.retrlines('LIST', bacteria_list.append)

output = sys.argv[1]
if not output.endswith("/"):
    output += "/"

print "bacteria_list: ", len(bacteria_list)


for c in bacteria_list:

    path_name = c.strip(" ").split()[-1]

    if "PATRIC.ec" in path_name:

        filename = path_name.split("/")[-1]
        site.retrbinary('RETR ' + path_name, open(output + filename , 'w').write)

score -1 · Answer 2 · edited May 23 '17 at 12:00

-1

While I have no experience with python let alone biopython, a quick google search found a couple things for you to look at.

urllib2 HTTP Error 400: Bad Request

urllib2 gives HTTP Error 400: Bad Request for certain urls, works for others

edited May 23 '17 at 12:00

Community

1
1

answered Sep 13 '13 at 20:16

Mike Z

76
1
10

Yes, it would appear that the error is coming from the efetch function. If I change the search term to something that only nets me a single find, it works fine. This leads me to believe that I am using the efetch function incorrectly or the NCBI servers are repelling me in some way when I make certain searches. – redvyper Sep 13 '13 at 20:22
I found something else. You may be hitting their servers too often, see if you can add in a count/breakpoint to see if/when you are restricted. http://stackoverflow.com/questions/14827131/urllib2-httperror-python?rq=1 – Mike Z Sep 13 '13 at 20:28
It currently works when I try using efetch for something "simple" like looking up a single protein. Biopython fortunately has some built in measures for each of its functions that prevents you from hitting their servers too often. However from my understanding, it'll stop you if the "job is too large," i.e. trying to request 10,000 files! I have ~100 organisms I want to look into. My code has a built in wait time, after each organism it analyzes all of the proteins before downloading another organism's proteins. – redvyper Sep 13 '13 at 20:33

Downloading Protein Sequences of multiple Organisms

2 Answers2