I am attempting to use biopython to download all of the proteins of a list of organisms sequenced by a specific institution. I have the organism names and BioProject's associated with each organism; specifically I am looking to analyze the proteins found in some recent genome sequences. I'd like to download the protein files in bulk, in the friendliest manner possible with efetch . My most recent attempt of downloading all of protein FASTA sequences for an associated organism is as follows:
net_handle = Entrez.efetch(db="protein",
id=mydictionary["BioPROJECT"][i],
rettype="fasta")
There are roughly 3000-4500 proteins associated with each organism; so using esearch and trying to efetch each protein one at a time is not realistic. Plus I'd like have a single FASTA file for each organism that encompasses all of its proteins.
Unfortunately when I run this line of code, I receive the following error:
urllib2.HTTPError: HTTP Error 400: Bad Request
.
It appears for all of the organisms I am interested in, I can't simply find their genonome sequence in their Nucleotide databank and download the "Protein encoding Sequences"
How may obtain these protein sequences I want in a manner that won't overload the NCBI servers? I was hoping that I could replicate what I can do on NCBI's web browser: select the protein database, search for the Bioproject number, and then save all of the found protein sequences into a single fasta file (under the "Send to" drop down menu)