0

I was following this post's code: https://quantixed.org/2021/04/04/ten-years-vs-the-spread-ii-calculating-publication-lag-times-in-r/ and was amazed at the ability to output received, accepted and published dates/gaps between them. Would there be a way to get any of the following:

-number of authors (could write a counter for separators on this one to be fair) -first author affiliation -last author affiliation -number of citations per article -degree of the first author

Or to see the full output of what is able to be pulled? What I tried so far:

In grabbing the first and last authors after the database printed all authors this sufficed: theData$authLast <- sapply(strsplit(theData$authors, "|", fixed=TRUE), tail, 1) theData$authFirst <- sapply(strsplit(theData$authors, "|", fixed=TRUE), head, 1)

however, when trying to get author affiliations the following gives me all affiliations: authAffil <- lapply(records, xpathSApply, ".//Author/AffiliationInfo", xmlValue) authAffil[sapply(authAffil, is.list)] <- NA authAffil <- sapply(authAffil, paste, collapse = "|")

Any direction in how to get the first author, affiliation, last author, affiliation into four columns from the database or other metrics listed above would be helpful. Thank you!

Edit: tried to make a reprex, let me know if this counts as a minimal reproducible example. thank you for the suggestion Ric Villalba!

#load in packages
library(reprex)
library(devtools)
#> Loading required package: usethis
install_github("ropensci/rentrez")
#> Skipping install of 'rentrez' from a github remote, the SHA1 (a225f213) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(rentrez)
require(XML)
#> Loading required package: XML
require(ggplot2)
#> Loading required package: ggplot2
require(ggridges)
#> Loading required package: ggridges
require(gridExtra)
#> Loading required package: gridExtra
# search pubmed using a search term (use_history allows retrieval of all records)
pp <- entrez_search(db="pubmed", term="cell[ta] AND 2010 : 2021[pdat] AND (journal article[pt] NOT review[pt] NOT comment[pt]
                    NOT autobiography[pt] NOT biography[pt] NOT case reports[pt] NOT clinical trial[pt]
                    NOT historical article[pt] NOT comparative study[pt] NOT evaluation study[pt]
                    NOT evaluation study[pt] NOT introductory journal article[pt])", use_history = TRUE)
pp_rec <- entrez_fetch(db="pubmed", web_history=pp$web_history, rettype="xml", parsed=TRUE)
# save records as XML file
saveXML(pp_rec, file = "Data/records.xml")
#> Error in saveXML(pp_rec, file = "Data/records.xml"): cannot create file Data/records.xml. Check the directory exists and permissions are appropriate
filename <- "~/Data/records.xml"
## extract a data frame from XML file
## This is modified from christopherBelter's pubmedXML R code
extract_xml <- function(theFile) {
  library(XML)
  newData <- xmlParse(theFile)
  records <- getNodeSet(newData, "//PubmedArticle")
  pmid <- xpathSApply(newData,"//MedlineCitation/PMID", xmlValue)
  doi <- lapply(records, xpathSApply, ".//ELocationID[@EIdType = \"doi\"]", xmlValue)
  doi[sapply(doi, is.list)] <- NA
  doi <- unlist(doi)
  authLast <- lapply(records, xpathSApply, ".//Author/LastName", xmlValue)
  authLast[sapply(authLast, is.list)] <- NA
  authInit <- lapply(records, xpathSApply, ".//Author/Initials", xmlValue)
  authInit[sapply(authInit, is.list)] <- NA
  authors <- mapply(paste, authLast, authInit, collapse = "|")
  authAffil <- lapply(records, xpathSApply, ".//Author/AffiliationInfo", xmlValue)
  authAffil[sapply(authAffil, is.list)] <- NA
  authAffil <- sapply(authAffil, paste, collapse = "|")
  theDF <- data.frame(pmid, doi, authors,authAffil, stringsAsFactors = FALSE)
  
  return(theDF)
}
#extract into a dataframe
theData <- extract_xml(filename)
#show the author affiliations as bunched
print(theData$authAffil[1])
#> [1] "Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA. Electronic address: kjsiddle@broadinstitute.org.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA; Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Division of Infectious Diseases, Massachusetts General Hospital, Boston, MA 02114, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Faculty of Arts and Sciences, Harvard University, Cambridge, MA 02138, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA.|Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.|Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA; Applied Epidemiology Fellowship, Council of State and Territorial Epidemiologists, Atlanta, GA 30345, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Barnstable County Department of Health and the Environment, Barnstable, MA 02630, USA.|Barnstable County Department of Health and the Environment, Barnstable, MA 02630, USA.|Barnstable County Department of Health and the Environment, Barnstable, MA 02630, USA.|Barnstable County Department of Human Services, Barnstable, MA 02630, USA.|Community Tracing Collaborative, Commonwealth of Massachusetts, Boston, MA 02199, USA.|Community Tracing Collaborative, Commonwealth of Massachusetts, Boston, MA 02199, USA.|Community Tracing Collaborative, Commonwealth of Massachusetts, Boston, MA 02199, USA.|Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Massachusetts Consortium for Pathogen Readiness, Boston, MA 02115, USA. Electronic address: bronwyn@broadinstitute.org.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA; Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA; Massachusetts Consortium for Pathogen Readiness, Boston, MA 02115, USA."

Created on 2022-11-05 with reprex v2.0.2

quantixed
  • 287
  • 3
  • 12
  • Hello, you must post a minimal reproducible example such that it could be answered without follow links and download databases. i.e post a sample of dput(authAffill) output or similar – Ric Nov 05 '22 at 15:49
  • @RicVillalba absolutely can do, just attempted to add in a reprex I'm hoping this suffices? if not let me know! – mickmars51 Nov 05 '22 at 16:29
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. – Community Nov 16 '22 at 06:45

1 Answers1

0

In the code that you posted the extract_xml() function will pull out information from a large xml file retrieved using rentrez. Using the logic in your question you can get four columns of first author, affiliation, last author, affiliation like this:

theData$authFirst <- sapply(strsplit(theData$authors, "|", fixed=TRUE), head, 1)
theData$affilFirst <- sapply(strsplit(theData$authAffil, "|", fixed=TRUE), head, 1)
theData$authLast <- sapply(strsplit(theData$authors, "|", fixed=TRUE), tail, 1) 
theData$affilLast <- sapply(strsplit(theData$authAffil, "|", fixed=TRUE), tail, 1)

This will append four columns to the data frame called theData which was created in your reprex.

quantixed
  • 287
  • 3
  • 12
  • thank you so much for responding! the reason why I stayed away from this approach is the last listed affiliation often does not correspond to the senior author of the paper (ex https://pubmed.ncbi.nlm.nih.gov/33423134/ where the last author is 7/13), but the first and last author points are well taken! if the data itself can be pulled from the df that would be spectacular, but I have a feeling I will have to change the output of the xml object into that data frame to fix my issues if you had any advice on that? I am currently trying to find ways to visualize all the outputs the xml can offer – mickmars51 Nov 06 '22 at 20:49
  • Ah OK, you didn't want last author/affiliation necessarily you wanted corresponding author/affiliation. This is tricky since a) there is no field to denote one author class from another and b) there can be multiple corresponding authors. I think the best approach would be to parse the affiliations (in the dataframe you already have) for "electronic address" or the @ symbol and use the affiliation of last occurrence to select the last author and affiliation, i.e. assume that that person is the senior author. – quantixed Nov 06 '22 at 21:28
  • my apologies for not being more clear! I think my confusion stemmed from this tutorial https://cran.r-project.org/web/packages/rentrez/vignettes/rentrez_tutorial.html where the "last_author" output might be as you are saying not the senior author. I will definitely try your approach to look for the email as a corresponding author, thank you so much! – mickmars51 Nov 06 '22 at 21:46