I am trying to download sequence data from E. coli samples within the state of Washington - it's about 1283 sequences, which I know is a lot. The problem that I am running into is that entrez_search and/or entrez_fetch seem to be pulling the wrong data. For example, the following R code does pull 1283 IDs, but when I use entrez_fetch on those IDs, the sequence data I get is from chickens and corn and things that are not E. coli:
search <- entrez_search(db = "biosample",
term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
retmax = 9999, use_history = T)
Similarly, I tried pulling the sequence from one sample manually as a test. When I search for the accession number SAMN30954130 on the NCBI website, I see metadata for an E. coli sample. When I use this code, I see metadata for a chicken:
search <- entrez_search(db = "biosample",
term = "SAMN30954130[ACCN]",
retmax = 9999, use_history = T)
fetch_test <- entrez_fetch(db = "nucleotide",
id = search$ids,
rettype = "xml")
fetch_list <- xmlToList(fetch_test)