Genbank query (package seqinr): searching in sequence description

Question

I am using the function query() of package seqinr to download myoglobin DNA sequences from Genbank. E.g.:

query("myoglobins","K=myoglobin AND SP=Turdus merula")

Unfortunately, for a lot of the species I'm looking for I don't get any sequence at all (or for this species, only a very short one), even though I find sequences when I search manually on the website. This is because of searching for "myoglobin" in the keywords only, while often there isn't any entry in there. Often the protein type is only specified in the name ("definition" on Genbank) -- but I have no idea how to search for this. The help page on query() doesn't seem to offer any option for this in the details, a "generic search" without any "K=" doesn't work, and I haven't found anything via googling.

I'd be happy about any links, explanations and help. Thank you! :)

score 2 · Answer 1 · answered Jan 08 '14 at 17:18

There is a complete manual for the seqinr package which describes the query language more in depth in chapter 5 (available at http://seqinr.r-forge.r-project.org/seqinr_2_0-1.pdf). I was trying to do a similar query and the description for many of the genes/cds is blank so they don't come up when searching using the k= option. One alternative would be to search for the organism alone, then match gene names in the individual annotations and pull out the accession numbers, which you could then use to re-query the database for your sequences.

This would pull out the annotation for the first gene:

choosebank("emblTP")
 query("ACexample", "sp=Turdus merula")
 getName(ACexample$req[[1]])
 annotations <- getAnnot(ACexample$req[[1]])
 cat(annotations, sep = "\n")

I think that this would be a pretty time consuming way to tackle the problem but there doesn't seem to be an efficient way of searching the annotations directly. I'd be interested in any solutions you might come up with.

Genbank query (package seqinr): searching in sequence description

1 Answers1