I am querying PubMED with a long list of PMIDs using R. Because entrez_fetch can only do a certain number at a time, I have broken down my ~2000 PMIDs into one list with several vectors (each about 500 in length). When I query PubMED, I am extracting information from XML files for each publication. What I would like to have in the end is something like this:
Original.PMID Publication.type
26956987 Journal.article
26956987 Meta.analysis
26956987 Multicenter.study
26402000 Journal.article
25404043 Journal.article
25404043 Meta.analysis
Each publication has a unique PMID but there may be several publication types associated with each PMID (as seen above). I can query the PMID number from the XML file, and I can get the publication types of each PMID. What I have problems with is repeating the PMID x number of times so that each PMID is associated with each of the publication type it has. I am able to do this if I don't have my data in a list with multiple sublists (e.g., if I have 14 batches, each as its own data frame) by getting the number of children nodes from the parent PublicationType node. But I can't seem to figure out how to do this for within a list.
My code so far is this:
library(rvest)
library(tidyverse)
library(stringr)
library(regexr)
library(rentrez)
library(XML)
pubmed<-my.data.frame
into.batches<-function(x,n) split(x,cut(seq_along(x),n,labels=FALSE))
batches<-into.batches(pubmed.fwd$PMID, 14)
headings<-lapply(1:14, function(x) {paste0("Batch",x)})
names(batches)<-headings
fwd<-sapply(batches, function(x) entrez_fetch(db="pubmed", id=x, rettype="xml", parsed=TRUE))
trial1<-lapply(fwd, function(x)
list(pub.type = xpathSApply(x, "//PublicationTypeList/PublicationType", xmlValue),
or.pmid = xpathSApply(x, "//ArticleId[@IdType='pubmed']", xmlValue)))
trial1 is what I am having problems with. This gives me a list where within each Batch, I have a vector for pub.type and a vector for or.pmid but they're different lengths.
I am trying to figure out how many children publication types there are for each publication, so I can repeat the PMID that many number of times. I am currently using the following code which does not do what I want:
trial1<-lapply(fwd, function(x)
list(childnodes = xpathSApply(xmlRoot(x), "count(.//PublicationTypeList/PublicationType)", xmlChildren)))
Unfortunately, this just tells me the total number of children nodes for each batch, not for each publication (or pmid).