3

An annoying problem many chemists are faced with is to convert CAS registry numbers of chemical compounds (stored in some commercial database that is not readily accessible) to Pubchem identifiers (openly available). Pubchem kind of supports conversion between the two, but only through their manual web interface, and not their official PUG REST programmatic interface.

A solution in Ruby is given here, based on the e-utilities interface: http://depth-first.com/articles/2007/09/13/hacking-pubchem-convert-cas-numbers-into-pubchem-cids-with-ruby/

Does anybody know how this would translate into R?

EDIT: based on the answerbelow, the most elegant solution is:

library(XML)
library(RCurl)

CAStocids=function(query) {
  xmlresponse = xmlParse( getURL(paste("http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pccompound&retmax=100&term=",query,sep="") ) )
  cids = sapply(xpathSApply(xmlresponse, "//Id"), function(n){xmlValue(n)})
  return(cids)
}

> CAStocids("64318-79-2")
[1] "6434870" "5282237"

cheers, Tom

Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103
  • formatting is one thing. I find it a lot easier to read and less ugly if you add linebreaks and spaces so all `gsub` and the `grep` are lined-up one under the other. – flodel Feb 04 '14 at 12:22
  • Yes that's true - I'll do - but point is that one single grep expression should suffice to extract the cid's - it would have to look for a number following "CID" or "?cid=" in webp. – Tom Wenseleers Feb 04 '14 at 12:27

2 Answers2

7

This how the Ruby code does it, translated to R, uses RCurl and XML:

> xmlresponse = xmlParse( getURL("http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pccompound&retmax=100&term=64318-79-2") )

and here's how to extract the Id nodes:

> sapply(xpathSApply(xmlresponse, "//Id"), function(n){xmlValue(n)})
 [1] "6434870" "5282237"

wrap all that in a function....

 convertU = function(query){
    xmlresponse = xmlParse(getURL(
       paste0("http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pccompound&retmax=100&term=",query))) 
    sapply(xpathSApply(xmlresponse, "//Id"), function(n){xmlValue(n)})
 }

> convertU("64318-79-2")
[1] "6434870" "5282237"
> convertU("64318-79-1")
list()
> convertU("64318-78-2")
list()
> convertU("64313-78-2")
[1] "313"

maybe needs a test if not found.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
3

I think you should still be able to convert CAS numbers to PubChem ID's using the PUG where instead of the name of the compound you enter the CAS number. Of course this might not be as specific if the CAS numbers overlap. I haven't tested it.

An example with aspirin https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/50-78-2/cids/JSON

whetlake
  • 31
  • 1