Extracting pubmed abstracts in r retrieves each abstract in multiple rows (more rows in abstracts that in pubmed ID)

Question

I am trying to extract pubmed abstracts and their titles to place them in a dataframe. will the help of members stackoverflow, I was able to write the code below, which works. The issue now is that the number of rows in the abstracts variable is higher than that of pmid or title, therefore I am unable to merge them correctly. Looking at the structure of the xml file I have, it appears the abstracts have more than one ?node, that's why they get extracted in > one row. Any suggestion how to overcome that and have each abstract in one row, so I can merge the variables.

Here is my code:



library(XML)
library(httr)
library(glue)
library(dplyr)
####



query = 'asthma[mesh]+AND+eosinophils[mesh]+AND+2009[pdat]'

 
reqq = glue ('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&RetMax=50&term={query}')


op = GET(reqq)

content(op)


df_op <- op %>% xml2::read_xml() %>% xml2::as_list()

pmids <- df_op$eSearchResult$IdList %>% unlist(use.names = FALSE)



reqq1 = glue("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={paste0(pmids, collapse = ',')}&rettype=abstract&retmode=xml")
op1 = GET(reqq1)



a = xmlParse(content(op1))


pmidd = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/PMID', xmlValue))

title = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/ArticleTitle', xmlValue))

abstract = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/Abstract/AbstractText', xmlValue))

nrow(pmidd)
nrow(abstract)

There are at least 2 packages that facilitate access to PubMed. [easyPubMed](https://cran.r-project.org/web/packages/easyPubMed/index.html) and [pubmedR](https://cran.r-project.org/web/packages/pubmedR/index.html). It's probably easier to work with these. — Till, Sep 16 '21 at 04:31
@Till, thanks for the input. I am aware of these packages and I totally agree with you that they are much easier to work with. But for this particular task I am doing now, I need to work directly with the api. — Bahi8482, Sep 16 '21 at 04:36

Till · Accepted Answer · 2021-09-16T05:37:27.830

Some articles come with the abstract spread in several sections (Objective, Methods, ....), some have just one entry and then some don't have an abstract at all. You'll have to take care of all these different scenarios.

xml::xmlToList() can be used to extract a list from the xml data. We can then use purrr's map*() commands to flatten the data.

library(purrr)
b <- xmlToList(a)


res <- map_dfr(b, \(x) {
  abstract_l <- x$MedlineCitation$Article$Abstract
  if (is.null(abstract_l))
    abstract_l <- ""
  tibble(
    pmid = x$MedlineCitation$PMID$text,
    title = x$MedlineCitation$Article$ArticleTitle,
    abstract = ifelse(
      length(abstract_l) > 1,
      map_chr(abstract_l, \(y) y[[1]]) |> paste(collapse = "\n"),
      unlist(abstract_l)
    )
  )
})
res$abstract

this works real well. thank you. It is a bit complex for me (particularly with the new r pipes) but I trying to understand the steps. — Bahi8482, Sep 19 '21 at 03:18

Extracting pubmed abstracts in r retrieves each abstract in multiple rows (more rows in abstracts that in pubmed ID)

1 Answers1