I am trying to extract pubmed abstracts and their titles to place them in a dataframe. will the help of members stackoverflow, I was able to write the code below, which works. The issue now is that the number of rows in the abstracts variable is higher than that of pmid or title, therefore I am unable to merge them correctly. Looking at the structure of the xml file I have, it appears the abstracts have more than one ?node, that's why they get extracted in > one row. Any suggestion how to overcome that and have each abstract in one row, so I can merge the variables.
Here is my code:
library(XML)
library(httr)
library(glue)
library(dplyr)
####
query = 'asthma[mesh]+AND+eosinophils[mesh]+AND+2009[pdat]'
reqq = glue ('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&RetMax=50&term={query}')
op = GET(reqq)
content(op)
df_op <- op %>% xml2::read_xml() %>% xml2::as_list()
pmids <- df_op$eSearchResult$IdList %>% unlist(use.names = FALSE)
reqq1 = glue("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={paste0(pmids, collapse = ',')}&rettype=abstract&retmode=xml")
op1 = GET(reqq1)
a = xmlParse(content(op1))
pmidd = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/PMID', xmlValue))
title = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/ArticleTitle', xmlValue))
abstract = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/Abstract/AbstractText', xmlValue))
nrow(pmidd)
nrow(abstract)