3
# parse PubMed data 

library(XML) # xpath
library(rentrez) # entrez_fetch

 pmids <- c("25506969","25032371","24983039","24983034","24983032","24983031","26386083",
          "26273372","26066373","25837167","25466451","25013473","23733758")

# Above IDs are mix of Books and journal articles 
# ID# 23733758 is an journal article and has No abstract
data.pubmed <- entrez_fetch(db = "pubmed", id = pmids, rettype = "xml",
               parsed = TRUE)
abstracts <-  xpathApply(data.pubmed, "//Abstract", xmlValue)
names(abstracts) <- pmids

It works well if every record has an abstract. However, when there is a PMID (#23733758) without a pubmed abstract ( or a book article or something else), it skips resulting in an error 'names' attribute [5] must be the same length as the vector [4]

Q: How to pass multiple paths/nodes so that, I can extract journal article, Books or Reviews ? UPDATE : hrbrmstr solution helps to address the NA. But,can xpathApply take multiple nodes like c(//Abstract, //ReviewArticle , etc etc )?

zx8754
  • 52,746
  • 12
  • 114
  • 209
user5249203
  • 4,436
  • 1
  • 19
  • 45
  • 1
    You could use a `try()` or a `tryCatch()` – Rich Scriven Oct 05 '15 at 16:32
  • Hi Richard, no sure If I understood your solution. My goal is to get an output of 5 abstracts, if i my input is 5 PMIDs. If there is no abstract, it should still return null value ( 4 abstracts, and 1 Null). So, When I add PMID as names I get to know which PMID has no abstract info. – user5249203 Oct 05 '15 at 17:06

1 Answers1

2

You have to attack it one tag element up:

abstracts <-  xpathApply(data.pubmed, "//PubmedArticle//Article", function(x) {
  val <- xpathSApply(x, "./Abstract", xmlValue)
  if (length(val)==0) val <- NA_character_
  val
})
names(abstracts) <- pmids

str(abstracts)
List of 5
## $ 24019382: chr "Adenocarcinoma of the lung, a leading cause of cancer death, frequently displays mutational activation of the KRAS proto-oncoge"| __truncated__
## $ 23927882: chr "Mutations in components of the mitogen-activated protein kinase (MAPK) cascade may be a new candidate for target for lung cance"| __truncated__
## $ 23825589: chr "Aberrant activation of MAP kinase signaling pathway and loss of tumor suppressor LKB1 have been implicated in lung cancer devel"| __truncated__
## $ 23792568: chr "Sorafenib, the first agent developed to target BRAF mutant melanoma, is a multi-kinase inhibitor that was approved by the FDA f"| __truncated__
## $ 23733758: chr NA

Per your comment with an alternate way to do this:

str(xpathApply(data.pubmed, '//PubmedArticle//Article', function(x) {
  xmlValue(xmlChildren(x)$Abstract)
}))

## List of 5
##  $ : chr "Adenocarcinoma of the lung, a leading cause of cancer death, frequently displays mutational activation of the KRAS proto-oncoge"| __truncated__
##  $ : chr "Mutations in components of the mitogen-activated protein kinase (MAPK) cascade may be a new candidate for target for lung cance"| __truncated__
##  $ : chr "Aberrant activation of MAP kinase signaling pathway and loss of tumor suppressor LKB1 have been implicated in lung cancer devel"| __truncated__
##  $ : chr "Sorafenib, the first agent developed to target BRAF mutant melanoma, is a multi-kinase inhibitor that was approved by the FDA f"| __truncated__
##  $ : chr NA
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • Hi hrbrmstr, Thank you for your response. I am afraid if I do this, this will restrict to only PubmedArticle. In a batch search of PMIDs, some may be books, review or other kind of publications. – user5249203 Oct 05 '15 at 17:23
  • You can change the XPath for the wrapper then. (you can select multiple with `and` conditions) – hrbrmstr Oct 05 '15 at 17:35
  • 1
    it returns a list when i run it (i added the output to the example answer) – hrbrmstr Oct 07 '15 at 14:28