Retrieving whole genome genbank files for some organism using Biojava or Biopython

Question

does anyone have an idea how to automatically search and parse gbk files from FTP ncbi using either BIopython or BioJAVA. I have searched for the utilities in BIojava and have not found any. I have also tried BioPython and here is my Code:

from Bio import Entrez
Entrez.email = "test@yahoo.com"
Entrez.tool = "MyLocalScript"
handle = Entrez.esearch(db="nucleotide", term="Mycobacterium avium[Orgn]")
record = Entrez.read(handle)
print record
print record["Count"]
id_L = record["IdList"]
print id_L
print len(id_L)

However, there are only 3 mycobacterium avium species (whole genome sequences and fully annotated) the result I am getting is 59897.

Can anyone tell me how to perform the search either in BioJava or BioPython. Otherwise I will have to automate this process form scratch.

Thank you.

cross posted on biostats: http://www.biostars.org/p/95289/ – Pierre Mar 14 '14 at 19:02 — Pierre, Mar 14 '14 at 19:02

score 1 · Answer 1 · answered Mar 16 '14 at 09:15

1

The way we do it is by specifying the id specifically using the efetch interface:

Entrez.efetch(db="nucleotide", id=<ACCESSION ID HERE>, rettype="gb", retmode="text")

Using a search term such as the one you used returns too many matches, all of which you are downloading. See 48 different bioprojects with your search term here:

http://www.ncbi.nlm.nih.gov/bioproject/?term=Mycobacterium+avium

From experience, the most accurate way to get what you want is to use the ACCESSION ID.

answered Mar 16 '14 at 09:15

Spade

2,220
1
19
29

The thing is I only have the genus and species of the bacteria. I do not have the accession id, hence the search by name. – user3419642 Mar 17 '14 at 05:56
1

Then you have to deal with the myriad experiments that people have done over time and submitted to NCBI over the years. The shape of things at Genbank change continuously and what is correct is relative for a lot of people. There is no systematic curation process that puts one dataset above another. This challenges bots such as yours in the search process for the right data set. – Spade Mar 17 '14 at 17:33
Thanks. I think I will have to automate the process using the ftp service provided by NCBI. – user3419642 Mar 19 '14 at 06:36

score 0 · Answer 2 · answered Mar 22 '14 at 13:35

If you want to dynamically search the NCBI for this information in an automated way, you can do searches by name in the same way as with EFetch using the ESearch interface. This way you can get accesion IDs and then use this list to fetch the nucleotide information (or any information you need) with EFetch.

http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ESearch_

The Entrez E-Utilities are very flexible, although it is true that you will need to filter the results to only obtain the data you need.

However, if you are going to do further analysis with this data and you do not need to be very up-to-date with the latest version of the sequences, nor to have dynamic fetching of different types of data, maybe it is better that you just download the data you need from the ftp and locally process/filter it. That might be faster than performing queries against Entrez (which is in my opinion a little slow when queried in batch).

Thanks. What I have done is, I automate the search on the ftp service using a program that I have written. I enter my genus and specie, the program looks for it on the ftp directory in the Bacteria folder and downloads the gbk file for me. Entrez returns too many queries so it does not suit my purpose. THanks again — user3419642, Mar 25 '14 at 09:22
You're welcome. You got to a better solution yourself anyway :) — cnluzon, Mar 27 '14 at 21:33

Retrieving whole genome genbank files for some organism using Biojava or Biopython

2 Answers2