1

I have a list of articles obtained from an API, with my dataframe looking like this:

PMID        Year     Title                  Journal         Author 
33326729    2020     Avelumab Maintenance   PLoS biology    T., Powles
33326729    2020     Avelumab Maintenance   PLoS biology    B., Huang
33326729    2020     Avelumab Maintenance   PLoS biology    A., Di Pietro

Which I need to merge to this:

PMID        Year     Title                  Journal         Author-1         Author-2     Author-3
33326729    2020     Avelumab Maintenance   PLoS biology    T., Powles       B., Huang    A., Di Pietro

So basically, I need the articles to have the authors merged in a single row. I thought of sorting by id by doing as follows:

test <- setDT(PubMed_df)[, lapply(.SD, function(x) toString(na.omit(x))), by = "pmid"]

Outputs:
33326729    2020,2020,2020     Avelumab Maintenance,Avelumab Maintenance,Avelumab Maintenance   PLoS biology,PLoS biology,PLoS biology    T., Powles,B., Huang,A., Di Pietro

However, that generates the data with commas instead of separate columns. Does anyone know of a different function or how to adapt the setDT function in order to get my intended result? Thanks in advance

Edit: As requested the output of dput(head(PubMed_df)) :

structure(list(pmid = c("33326729", "33326729", "33326729", "33320856", 
"33320856", "33320856"), year = c("2020", "2020", "2020", "2021", 
"2021", "2021"), month = c("12", "12", "12", "01", "01", "01"
), day = c("21", "21", "21", "07", "07", "07"), lastname = c("Powles", 
"Huang", "di Pietro", "Reijns", "Thompson", "Acosta"), firstname = c("Thomas", 
"Bo", "Alessandra", "Martin A M", "Louise", "Juan Carlos"), address = c("St. Bartholomew's Hospital, London, United Kingdom thomas.powles1@nhs", 
"Pfizer, Groton, CT", "Pfizer, Milan, Italy", "MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom", 
"The South East of Scotland Clinical Genetic Service, Western General Hospital, NHS Lothian, Edinburgh, United Kingdom", 
"Cancer Research UK Edinburgh Centre, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom"
), journal = c("The New England journal of medicine", "The New England journal of medicine", 
"The New England journal of medicine", "PLoS biology", "PLoS biology", 
"PLoS biology"), title = c("Avelumab Maintenance for Urothelial Carcinoma. Reply.", 
"Avelumab Maintenance for Urothelial Carcinoma. Reply.", "Avelumab Maintenance for Urothelial Carcinoma. Reply.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection."
), abstract = c(NA, NA, NA, "", "", ""), doi = c("10.1056/NEJMc2032018", 
"10.1056/NEJMc2032018", "10.1056/NEJMc2032018", "10.1371/journal.pbio.3001030", 
"10.1371/journal.pbio.3001030", "10.1371/journal.pbio.3001030"
), keywords = c("Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2"
)), row.names = c(NA, 6L
), class = c("data.table", "data.frame"))

Edit 2: Highly detailed and specific request:

I need to get the data of which the head is presented above into a form in which every row has: PMID | Date of publication | Author 1 | Affiliation | Address | City | State (if US) | Country | Author 2 | Affiliation of Author 2 | Address | City | State (if US) | Country | And so on for each co-Author | Journal | Title | Abstract* | MH term

I will have to break up the addresses but thats something ill be focusing on later. For now my goal is to just get all the info for every author added to the right article without having 3 rows of the same article.

Edit 2 - To be used to get the answer from @r2evans to work in my case: The answer provided works if you use dcast as data.table::dcast!

Shandorius
  • 13
  • 1
  • 5
  • Do you want to [reshape data from long to wide in R](https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format)? – Rui Barradas Jan 12 '21 at 16:03
  • @RuiBarradas I guess in a sense thats what I am attempting to do, I have also tried using the reshape2 function as explained [here](https://stackoverflow.com/questions/49139288/combining-multiple-rows-into-one-row-with-multiple-columns-of-data-r) however I cant get it to work. – Shandorius Jan 12 '21 at 16:07
  • It would be easier for us to test on your data if you provided the output from `dput(head(PubMed_df))`; it is unambiguous, and makes it much easier for us to work around the embedded spaces in your data that defeat a simple copy/`read.table` assignment in R. – r2evans Jan 12 '21 at 16:49
  • @r2evans Thanks for your help! I am looking at your proposed answer now. Although I havent attempted it yet (which i will do after writing this comment) it looks promising. I also added the data you requested in my question. I see one complication, which is that you specify Author and i require it to be for multiple vars (i simply chose author as its the most important and would make it more straight forward). But hopefully ill be able to figure that bit out myself. Thanks! – Shandorius Jan 12 '21 at 18:26
  • An important thing to note when using `dput` on `data.table` objects is that the `.internal.selfref` breaks our ability to copy and paste into our console (one of the two benefits/purposes of requesting `dput` output). I'm removing it now, please keep that in mind in your future questions. Thanks! – r2evans Jan 12 '21 at 18:28
  • Shandorius, I did all of that because that is precisely what you provided and requested. Please be specific (careful!) what you ask for. – r2evans Jan 12 '21 at 18:31
  • @r2evans although the answer gets close, it doesn't work or I am doing something wrong. Instead of having columns called Author-1 Author-2 etc I get the names of authors as columns and either a 0 or a 1 depending on whether or not they wrote it. This however results in a dataframe 5044 columns wide. Is there a workaround to get what I want? Or is it simply not possible? – Shandorius Jan 12 '21 at 18:32
  • @r2evans Sorry! I hadnt seen your comment before. And sorry if im being unclear. Its just very complicated to explain what I mean since im tackling it step by step. I hope the extra clarification helps. It also doesn't help that I only started learning R a few months ago so some terms I do not properly understand still. I for example don't understand your comment regarding my implementation on the dput command. – Shandorius Jan 12 '21 at 18:37
  • How does my edit do on your real data? – r2evans Jan 12 '21 at 18:37
  • @r2evans Your output looks like what Im trying to achieve! So thats already great. Im getting the error `Error in .subset2(x, i, exact = exact) : recursive indexing failed at level 2` And im indeed using pmid as identifier :) – Shandorius Jan 12 '21 at 18:47
  • Sorry, no clue on that error. – r2evans Jan 12 '21 at 18:57
  • 1
    @r2evans no worries! Already appreciate your help a ton! Ill try and figure out what that means and how to resolve it. You already got me a lot further than I could have gotten by myself! Thank you so much! – Shandorius Jan 12 '21 at 19:06

1 Answers1

0

This is mostly a dupe from Rui's comment, but it helps to add a help-column to get it (I'll use row here). Since you started using data.table, I'll stick with that.

Edited to work with the updated data. (I'm assuming that pmid uniquely defines the groups.)

library(data.table)
setDT(PubMed_df)
PubMed_df[, row := seq_len(.N), by = .(pmid)]

And in Über-wide format:

dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
       pmid   year  month    day                             journal                                   title abstract                          doi                                keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3                               address_1                               address_2                               address_3
     <char> <char> <char> <char>                              <char>                                  <char>   <char>                       <char>                                  <char>     <char>     <char>     <char>      <char>      <char>      <char>                                  <char>                                  <char>                                  <char>
1: 33320856   2021     01     07                        PLoS biology A sensitive and affordable multiplex...          10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ...     Reijns   Thompson     Acosta  Martin A M      Louise Juan Carlos MRC Human Genetics Unit, MRC Institu... The South East of Scotland Clinical ... Cancer Research UK Edinburgh Centre,...
2: 33326729   2020     12     21 The New England journal of medicine Avelumab Maintenance for Urothelial ...     <NA>         10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ...     Powles      Huang  di Pietro      Thomas          Bo  Alessandra St. Bartholomew's Hospital, London, ...                      Pfizer, Groton, CT                    Pfizer, Milan, Italy

Realize that when you have papers with fewer authors than the max number of authors in your dataset, they will have empty/NA columns. For instance, if I remove rows 5-6 and do the same,

PubMed_df <- PubMed_df[1:4,]
dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
#        pmid   year  month    day                             journal                                   title abstract                          doi                                keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3                               address_1          address_2            address_3
#      <char> <char> <char> <char>                              <char>                                  <char>   <char>                       <char>                                  <char>     <char>     <char>     <char>      <char>      <char>      <char>                                  <char>             <char>               <char>
# 1: 33320856   2021     01     07                        PLoS biology A sensitive and affordable multiplex...          10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ...     Reijns       <NA>       <NA>  Martin A M        <NA>        <NA> MRC Human Genetics Unit, MRC Institu...               <NA>                 <NA>
# 2: 33326729   2020     12     21 The New England journal of medicine Avelumab Maintenance for Urothelial ...     <NA>         10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ...     Powles      Huang  di Pietro      Thomas          Bo  Alessandra St. Bartholomew's Hospital, London, ... Pfizer, Groton, CT Pfizer, Milan, Italy
r2evans
  • 141,215
  • 6
  • 77
  • 149