1

I apologize if this question has been asked with terminology I don't recognize but it doesn't appear to be.

I am using the function comm2sci in the library taxize to search for the scientific names for a database of over 120,000 rows of common names. Here is a subset of 10:

commnames <- c("WESTERN CAPERCAILLIE", "AARDVARK", "AARDWOLF", "ABACO ISLAND BOA", 
"ABBOTT'S DAY GECKO", "ABDIM'S STORK", "ABRONIA GRAMINEA", "ABYSSINIAN BLUE 
WINGED GOOSE", 
"ABYSSINIAN CAT", "ABYSSINIAN GROUND HORNBILL")

When searching with the NCBI database in this function, it asks for user input if the common name is generic/general and not species specific, for example the following call will ask for clarification for "AARDVARK" by entering '1', '2' or 'return' for 'NA'.

install.packages("taxize")
library(taxize)
ncbioutput <- comm2sci(commnames, db = "ncbi")###querying ncbi database

Because of this, I cannot rely on this function to find the names of the 120000 species without me sitting and entering 'return' every few minutes. I know this question sounds taxize specific - but I've had this situation in the past with other functions as well. My question is: is there a general way to place the comm2sci call in a conditional statement that will return a specific value when user input is prompted? Or otherwise write a function that will return some input when prompted?

All searches related to this tell me how to ask for user input but not how to override user queries. These are two of the question threads I've found, but I can't seem to apply them to my situation: Make R wait for console input?, Switch R script from non-interactive to interactive

I hope this was clear. Thank you very much for your time!

sckott
  • 5,755
  • 2
  • 26
  • 42
E Lundgren
  • 43
  • 4
  • I just ran your code, and I was only prompted to enter either `1` or `2` when there was more than one UID for taxon `AARDVARK` - is this what you want to automate?? – Evan Friedland Jul 21 '17 at 03:34
  • Yes exactly, with 120,000 observations there's no way to sit and enter 1 or 2 with each ambiguous common name. Is there a way to automate the entry? – E Lundgren Jul 21 '17 at 13:42
  • Have you tried emailing the author of the function? I see an email found at `?comm2sci` which may lead to a simple solution. – Evan Friedland Jul 21 '17 at 13:50
  • Yes, I suppose that's an option. I will email him. I've had this situation with other functions and so thought there might be a general strategy to deal with it. Thanks for the tip Evan – E Lundgren Jul 21 '17 at 14:34

1 Answers1

1

So the get_* functions, used internally, all by default ask for user input when there is > 1 option. But, all of those functions have a sister function with an underscore, e.g., get_uid_ that do not prompt for input, and return all data. You can use that to get all the data, then process however you like.

Made some changes to comm2sci, so update first: devtools::install_github("ropensci/taxize")

Here's an example.

library(taxize)
commnames <- c("WESTERN CAPERCAILLIE", "AARDVARK", "AARDWOLF", "ABACO ISLAND BOA", 
               "ABBOTT'S DAY GECKO", "ABDIM'S STORK", "ABRONIA GRAMINEA", 
               "ABYSSINIAN BLUE WINGED GOOSE", 
               "ABYSSINIAN CAT", "ABYSSINIAN GROUND HORNBILL")

Then use get_uid_ to get all data

ids <- get_uid_(commnames)

Process the results in ids as you like. Here, for brevity, we'll just grab first row of each

ids <- lapply(ids, function(z) z[1,])

Then grab the uid's out

ids <- as.uid(unname(vapply(ids, "[[", "", "uid")), check = FALSE)

And pass to comm2sci

comm2sci(ids)

$`100830`
[1] "Tetrao urogallus"

$`9818`
[1] "Orycteropus afer"

$`9680`
[1] "Proteles cristatus"

$`51745`
[1] "Chilabothrus exsul"

$`8565`
[1] "Gekko"

$`39789`
[1] "Ciconia abdimii"

$`278977`
[1] "Abronia graminea"

$`8865`
[1] "Cyanochen cyanopterus"

$`9685`
[1] "Felis catus"

$`153643`
[1] "Bucorvus abyssinicus"

Note that NCBI returns common names from get_uid/get_uid_, so you can just go ahead and pluck those out if you want

sckott
  • 5,755
  • 2
  • 26
  • 42
  • This looks great, will test it as soon as I can. But quick question to clarify: `get_uid / get/uid_` are for NCBI, and there are other `get_*` functions for the other databases taxize supports? Thank you so much @sckott – E Lundgren Jul 21 '17 at 18:39
  • Yes, those are for NCBI. they call their taxonomic IDs "uid"'s - Yes, there's `get_` functions for 13 diff data sources. – sckott Jul 21 '17 at 19:35
  • Hello @sckott, First of all, I'd upvote you if I had the reputation to do so. One more question. The code you sent chooses the first row of each object in the ids list with the lapply function. The problem is - I don't want to pretend that we know the exact species if the common name is ambiguous, so I'd rather choose 'NA' (as in the option given from the user query). When I set any object in the ids list to an empty data frame of NA values (with same column names as provided by get_uid_) the as.uid() call returns an error, which also happens with unrecognized common names. – E Lundgren Jul 22 '17 at 16:41
  • Choosing the fist row is just an example of what you can do - you don't have to do that. Process however you like. with your statement "When I set any object in the ids list to an empty data frame of NA values (with same column names as provided by get_uid_) the as.uid() call returns an error, which also happens with unrecognized common names" i don't get what you mean - probably best to open an issue in the repo: https://github.com/ropensci/taxize/issues/new – sckott Jul 24 '17 at 16:46
  • For example, if you send in a bogus or incorrectly spelled common name: `id <- get_uid_("LESSER PRAARIE CHICKEN")' the result is NULL and the ultimate result from comm2sci(id) (after as_uid()) is: "values must be length 1, but FUN(X[[1]]) result is length 10". I'm happy to open an issue in the repo, but I have found a work around for myself, by doing one name at a time in a for loop and testing for NULL for each case. – E Lundgren Jul 24 '17 at 17:05