2

I'd want to get all the drug names (like Lepirudin) in drugbank. the drugbank.xml is downloaded drugbank

require(XML)
drugbank<-  xmlParse("drugbank.xml")
tmp <- getNodeSet(drugbank, "//drug/name")

however, tmp is a null list. Just cannot find what's wrong. Thank you.

Update (A reproducible example):

require(XML)
xf <- '<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" version="4.3">
<drug type="biotech" created="2005-06-13" updated="2015-02-23">
  <drugbank-id primary="true">DB00001</drugbank-id>
  <drugbank-id>BIOD00024</drugbank-id>
  <drugbank-id>BTD00024</drugbank-id>
  <name>Lepirudin</name>
</drug>
<drug type="biotech" created="2005-06-13" updated="2011-07-31">
  <drugbank-id primary="true">DB00002</drugbank-id>
  <drugbank-id>BIOD00071</drugbank-id>
  <drugbank-id>BTD00071</drugbank-id>
  <name>Cetuximab</name>
</drug>
</drugbank>
'
drugbank<-  xmlParse(xf,  asText=TRUE)
tmp <- getNodeSet(drugbank, "//drug/name")
Zhilong Jia
  • 2,329
  • 1
  • 22
  • 34
  • 3
    You should include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in your question. One that doesn't involve downloading and unzipping a large XML file. – MrFlick Jul 03 '15 at 15:11
  • @MrFlick Thank you. See update, please. – Zhilong Jia Jul 03 '15 at 15:35

1 Answers1

2

The problem is the default namespace. The XML package has a problem using XPATH queries with default namespaces. You must explicitly define them yourself. This should work for your example

drugbank<-  xmlParse(xf,  asText=TRUE)
ns<-c("db"="http://www.drugbank.ca")
getNodeSet(drugbank, "//db:drug/db:name", namespaces=ns)

which returns

[[1]]
<name>Lepirudin</name> 

[[2]]
<name>Cetuximab</name> 

attr(,"class")
[1] "XMLNodeSet"

If you just wanted the names, you could do

xpathSApply(xmlRoot(drugbank), "//db:drug/db:name", xmlValue, namespaces=ns)
# [1] "Lepirudin" "Cetuximab"
MrFlick
  • 195,160
  • 17
  • 277
  • 295