1

I am trying to download sequence data from E. coli samples within the state of Washington - it's about 1283 sequences, which I know is a lot. The problem that I am running into is that entrez_search and/or entrez_fetch seem to be pulling the wrong data. For example, the following R code does pull 1283 IDs, but when I use entrez_fetch on those IDs, the sequence data I get is from chickens and corn and things that are not E. coli:

search <- entrez_search(db = "biosample", 
                        term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
                        retmax = 9999, use_history = T)

Similarly, I tried pulling the sequence from one sample manually as a test. When I search for the accession number SAMN30954130 on the NCBI website, I see metadata for an E. coli sample. When I use this code, I see metadata for a chicken:

search <- entrez_search(db = "biosample", 
                        term = "SAMN30954130[ACCN]",
                        retmax = 9999, use_history = T)
fetch_test <- entrez_fetch(db = "nucleotide",
                           id = search$ids,
                           rettype = "xml")
fetch_list <- xmlToList(fetch_test)
  • UIDs are unique to a database, not unique across databases. You are using a Biosample UID to query the Nucleotide database. In your example, the Biosample UID returned from the search is 30954130, but the corresponding Nucleotide UID is 2307876014. – neilfws Jan 27 '23 at 04:24

1 Answers1

1

The issue here is that you are using a Biosample UID to query the Nucleotide database. However, the UID is then interpreted as a Nucleotide UID, so you get a sequence record unrelated to your original Biosample query.

What you need to use in this situation is entrez_link, which uses a UID to link records between two databases.

For example, your Biosample accession SAMN30954130 has the Biosample UID 30954130. You link that to Nucleotide like this:

nuc_links <- entrez_link(dbfrom='biosample', id=30954130, db='nuccore')

And you can get the corresponding Nucleotide UID(s) like this:

nuc_links$links$biosample_nuccore

[1] "2307876014"

And then:

fetch_test <- entrez_fetch(db = "nucleotide",
                           id = 2307876014,
                           rettype = "xml")

This is covered in the section "Finding cross-references" of the rentrez tutorial.

neilfws
  • 32,751
  • 5
  • 50
  • 63
  • Thank you, this is helpful! This may be a dumb question, but then how do I get the actual sequence data? When I run that code, it looks like it gives me more metadata about the sequence, but I don't see the sequence itself. – notasfarwest Jan 27 '23 at 05:09
  • As it says at [the NCBI page for that nucleotide UID](https://www.ncbi.nlm.nih.gov/nuccore/2307876014): "This entry is the master record for a whole genome shotgun sequencing project and contains no sequence data." The contigs associated with the record [are here](https://www.ncbi.nlm.nih.gov/Traces/wgs/ABIKJS01?display=contigs) so perhaps you can figure out how to get to those accessions from the WGS XML record. – neilfws Jan 27 '23 at 05:15