Consolidating duplicate rows in a dataframe

Question

This is a continuation of a past question I asked. Basically, I have a dataframe, df

         Beginning1 Protein2    Protein3    Protein4    Biomarker1
Pathway3    A         G           NA           NA           F
Pathway6    A         G           NA           NA           E
Pathway2    A         B           H            NA           F
Pathway5    A         B           H            NA           E
Pathway1    A         D           K            NA           F
Pathway7    A         B           C            D            F
Pathway4    A         B           C            D            E

And now I want to consolidate the rows to look like this:

dfnew 
         Beginning1 Protein2    Protein3    Protein4    Biomarker1
Pathway3    A         G           NA           NA           F, E
Pathway2    A         B           H            NA           F, E
Pathway7    A         D           K            NA           F    
Pathway1    A         B           C            D            F, E

I've seen a lot of people consolidate identical rows in dataframes using aggregate, but I can't seem to get that function to work on non-numerical values. The closest question I have seen solved it like this: df1 <- aggregate(df[7], df[-7], unique) and can be found here: Combining duplicated rows in R and adding new column containing IDs of duplicates.

Also, not every pathway has a matching pair, as can be seen in pathway 1.

Thank you so much for your help!

Could you explain in words how you are combining them? Are you just taking every pair of rows and then pasting together Biomarker1? — josliber, Jun 28 '17 at 17:48
Also, it appears you have a column named `1`, which is not a valid variable name in a data frame. — josliber, Jun 28 '17 at 17:48
No, its not just every pair of rows. It is only the rows that are identical from Protein 2 to Protein4. I just added in another hypothetical pathway, to show that not all pathways have a matching pair. — Taylor Maurer, Jun 28 '17 at 17:51

Konrad Rudolph · Answer 1 · 2017-06-28T18:10:45.627

2

The following solution using the ‹dplyr› and ‹tidyr› packages should do what you want:

df %>%
    group_by(Protein2, Protein3, Protein4) %>%
    nest() %>%
    mutate(Biomarker1 = lapply(data, `[[`, 'Biomarker1'),
           Biomarker1 = unlist(lapply(Biomarker1, paste, collapse = ', '))) %>%
    ungroup() %>%
    # Restore the “Beginning1” column is a bit of work, unfortunately.
    mutate(Beginning1 = lapply(data, `[[`, 'Beginning1'),
           Beginning1 = unlist(lapply(Beginning1, `[[`, 1))) %>%
    select(-data)

edited Jun 28 '17 at 18:10

answered Jun 28 '17 at 17:58

Konrad Rudolph

530,221
131
937
1,214

is the nest function in a specific package? – Taylor Maurer Jun 28 '17 at 18:02
@TaylorMaurer Yes, as mentioned in my answer. Though see update. I made a mistake. – Konrad Rudolph Jun 28 '17 at 18:05
nest() appears to be in the package tidyr, but when ever I try to run the function, I get this error: Error in function_list[[i]](value) : could not find function "nest" – Taylor Maurer Jun 28 '17 at 18:09
I am having issues using this with a larger dataframe that contains 14 proteins (and matches identical pathways from Protein2:Protein14). Could you explain your code a little bit so I can modify it for this new dataset? – Taylor Maurer Jun 28 '17 at 18:30
1

@TaylorMaurer To modify this code for 14 proteins, you need to modify the `group_by` call and replace it with `group_by_(paste0('Protein', 2 : 14))`. However, at this point you should reconsider your data format: instead of having 14 protein columns you should probably transform your table into [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html), with one single column for the protein number and one for its amino acid (?). This will lead to a complete rewrite of the code though. – Konrad Rudolph Jun 29 '17 at 10:10

score 0 · Answer 2 · answered Jun 28 '17 at 19:27

0

This is a dplyr solution which should yield the expected result.

library(dplyr)

df <- df %>%
      group_by(Beginning1, Protein2, Protein3, Protein4) %>%
      summarise(Biomarker1 = paste(Biomarker1, collapse = ", "))

answered Jun 28 '17 at 19:27

Matt Jewett

3,249
1
14
21

When I try to run either your code or the code by @KonradRudolph on a much large dataset, I keep getting this error: Error in `[[<-.data.frame`(`*tmp*`, key_col, value = list(list(Beginning1 = c(1L, : replacement has 7 rows, data has 97 – Taylor Maurer Jun 28 '17 at 20:53
Do either of you know what this means or how I can fix this error? – Taylor Maurer Jun 28 '17 at 20:54

Consolidating duplicate rows in a dataframe

2 Answers2

Linked