2

This is a continuation of a past question I asked. Basically, I have a dataframe, df

         Beginning1 Protein2    Protein3    Protein4    Biomarker1
Pathway3    A         G           NA           NA           F
Pathway6    A         G           NA           NA           E
Pathway2    A         B           H            NA           F
Pathway5    A         B           H            NA           E
Pathway1    A         D           K            NA           F
Pathway7    A         B           C            D            F
Pathway4    A         B           C            D            E

And now I want to consolidate the rows to look like this:

dfnew 
         Beginning1 Protein2    Protein3    Protein4    Biomarker1
Pathway3    A         G           NA           NA           F, E
Pathway2    A         B           H            NA           F, E
Pathway7    A         D           K            NA           F    
Pathway1    A         B           C            D            F, E

I've seen a lot of people consolidate identical rows in dataframes using aggregate, but I can't seem to get that function to work on non-numerical values. The closest question I have seen solved it like this: df1 <- aggregate(df[7], df[-7], unique) and can be found here: Combining duplicated rows in R and adding new column containing IDs of duplicates.

Also, not every pathway has a matching pair, as can be seen in pathway 1.

Thank you so much for your help!

Thomas
  • 43,637
  • 12
  • 109
  • 140
Taylor Maurer
  • 239
  • 1
  • 3
  • 9
  • Could you explain in words how you are combining them? Are you just taking every pair of rows and then pasting together Biomarker1? – josliber Jun 28 '17 at 17:48
  • Also, it appears you have a column named `1`, which is not a valid variable name in a data frame. – josliber Jun 28 '17 at 17:48
  • No, its not just every pair of rows. It is only the rows that are identical from Protein 2 to Protein4. I just added in another hypothetical pathway, to show that not all pathways have a matching pair. – Taylor Maurer Jun 28 '17 at 17:51
  • OK, please edit the question to make that clear. – josliber Jun 28 '17 at 17:52

2 Answers2

2

The following solution using the ‹dplyr› and ‹tidyr› packages should do what you want:

df %>%
    group_by(Protein2, Protein3, Protein4) %>%
    nest() %>%
    mutate(Biomarker1 = lapply(data, `[[`, 'Biomarker1'),
           Biomarker1 = unlist(lapply(Biomarker1, paste, collapse = ', '))) %>%
    ungroup() %>%
    # Restore the “Beginning1” column is a bit of work, unfortunately.
    mutate(Beginning1 = lapply(data, `[[`, 'Beginning1'),
           Beginning1 = unlist(lapply(Beginning1, `[[`, 1))) %>%
    select(-data)
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • is the nest function in a specific package? – Taylor Maurer Jun 28 '17 at 18:02
  • @TaylorMaurer Yes, as mentioned in my answer. Though see update. I made a mistake. – Konrad Rudolph Jun 28 '17 at 18:05
  • nest() appears to be in the package tidyr, but when ever I try to run the function, I get this error: Error in function_list[[i]](value) : could not find function "nest" – Taylor Maurer Jun 28 '17 at 18:09
  • I am having issues using this with a larger dataframe that contains 14 proteins (and matches identical pathways from Protein2:Protein14). Could you explain your code a little bit so I can modify it for this new dataset? – Taylor Maurer Jun 28 '17 at 18:30
  • 1
    @TaylorMaurer To modify this code for 14 proteins, you need to modify the `group_by` call and replace it with `group_by_(paste0('Protein', 2 : 14))`. However, at this point you should reconsider your data format: instead of having 14 protein columns you should probably transform your table into [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html), with one single column for the protein number and one for its amino acid (?). This will lead to a complete rewrite of the code though. – Konrad Rudolph Jun 29 '17 at 10:10
0

This is a dplyr solution which should yield the expected result.

library(dplyr)

df <- df %>%
      group_by(Beginning1, Protein2, Protein3, Protein4) %>%
      summarise(Biomarker1 = paste(Biomarker1, collapse = ", "))
Matt Jewett
  • 3,249
  • 1
  • 14
  • 21
  • When I try to run either your code or the code by @KonradRudolph on a much large dataset, I keep getting this error: Error in `[[<-.data.frame`(`*tmp*`, key_col, value = list(list(Beginning1 = c(1L, : replacement has 7 rows, data has 97 – Taylor Maurer Jun 28 '17 at 20:53
  • Do either of you know what this means or how I can fix this error? – Taylor Maurer Jun 28 '17 at 20:54