3

I have to download only complete genome sequences from NCBI (GenBank(full) format). I am intrested in 'complete geneome' not 'whole genome'.

my script:

from Bio import Entrez
Entrez.email = "asiakXX@wp.pl"
gatunek='Escherichia[ORGN]'
handle = Entrez.esearch(db='nucleotide',
     term=gatunek, property='complete genome' )#title='complete genome[title]')
result = Entrez.read(handle)

As a results I get only small fragments of genomes, whith size about 484 bp:

LOCUS       NZ_KE350773              484 bp    DNA     linear   CON 23-AUG-2013
DEFINITION  Escherichia coli E1777 genomic scaffold scaffold9_G, whole genome
       shotgun sequence.

I know how to do it manually via NCBI web site but it is very time consuming, the query that I use there:

escherichia[orgn] AND complete genome[title]

and as result I get multiple genomes with sizes range about 5,154,862 bp and this is what I need to do via ENTREZ.esearch.

user2662581
  • 31
  • 1
  • 4

3 Answers3

1

You've done the hard part and worked out the query,

escherichia[orgn] AND complete genome[title]

So use that as the search query via Biopython as well!

from Bio import Entrez
Entrez.email = "asiakXX@wp.pl"
search_term = "escherichia[orgn] AND complete genome[title]"
handle = Entrez.esearch(db='nucleotide', term=search_term)
result = Entrez.read(handle)
handle.close()
print(result['Count']) # added parenthesis 

Currently that gives me 140 results, starting with 545778205, which is the same as the website: http://www.ncbi.nlm.nih.gov/nuccore/?term=escherichia%5Borgn%5D+AND+complete+genome%5Btitle%5D

Shred
  • 358
  • 2
  • 15
Peter Cock
  • 1,585
  • 1
  • 9
  • 14
1

Your question is clear, but the full answer is long. The code I provide generates a .fasta file for each of your desired E.Coli genome sequences, yes only the "Complete Genomes" in NCBI.

You will see there are only six complete E.Coli reference genomes in NCBI (http://www.ncbi.nlm.nih.gov/genome/167):

enter image description here

To help you, here are the Genbank/Refseq links to their genomes:

  1. http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3

  2. http://www.ncbi.nlm.nih.gov/nuccore/NC_002695.1

  3. http://www.ncbi.nlm.nih.gov/nuccore/NC_011750.1

  4. http://www.ncbi.nlm.nih.gov/nuccore/NC_011751.1

  5. http://www.ncbi.nlm.nih.gov/nuccore/NC_017634.1

  6. http://www.ncbi.nlm.nih.gov/nuccore/NC_018658.1

Here is my code for Complete Genome Sequence Parsing into .FASTA files...

# Imports
from Bio import Entrez
from Bio import SeqIO

#############################
# Retrieve NCBI Data Online #
#############################

Entrez.email     = "asiak@wp.pl"             # Always tell NCBI who you are
genomeAccessions = ['NC_000913', 'NC_002695', 'NC_011750', 'NC_011751', 'NC_017634', 'NC_018658']
search           = " ".join(genomeAccessions)
handle           = Entrez.read(Entrez.esearch(db="nucleotide", term=search, retmode="xml"))
genomeIds        = handle['IdList']
records          = Entrez.efetch(db="nucleotide", id=genomeIds, rettype="gb", retmode="text")

###############################
# Generate Genome Fasta files #
###############################

sequences   = []  # store your sequences in a list
headers     = []  # store genome names in a list (db_xref ids)

for i,record in enumerate(records):

    file_out = open("genBankRecord_"+str(i)+".gb", "w")    # store each genomes .gb in separate files
    file_out.write(record.read())
    file_out.close()

    genomeGenbank   = SeqIO.read("genBankRecord"+str(i)+".gb", "genbank")  # parse in the genbank files
    header         = genome.features[0].qualifiers['db_xref'][0]          # name the genome using db_xfred ID
    sequence       = genome.seq.tostring()                                # obtain genome sequence

    headers.append('>'+header)  # store genome name in list                                     
    sequences.append(sequence)  # store sequence in list

    fasta_out = open("genome"+str(i)+".fasta","w")     # store each genomes .fasta in separate files
    fasta_out.write(header)    # >header ... followed by:
    fasta_out.write(sequence)  # sequence ... 
    fasta_out.close()          # close that .fasta file and move on to next genome
records.close()

Let me know how it goes! Andy

hello_there_andy
  • 2,039
  • 2
  • 21
  • 51
0

This works for me...

search_term = 'escherichia coli[orgn] AND complete genome[title]'
handle = Entrez.esearch(db='nucleotide', term=search_term)
genome_ids = Entrez.read(handle)['IdList']

for genome_id in genome_ids:
    record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")

    filename = 'generated/genBankRecord_{}.gb'.format(genome_id)
    print('Writing:{}'.format(filename))
    with open(filename, 'w') as f:
        f.write(record.read())
schryer
  • 103
  • 1
  • 3