6

Basically, I have a dataframe, df

                  Beginning1 Protein2    Protein3    Protein4    Biomarker1
      Pathway3    A         G           NA           NA           F
      Pathway8    Z         G           NA           NA           E
      Pathway9    A         G           Z            H            F
      Pathway6    Y         G           Z            H            E
      Pathway2    A         G           D            NA           F
      Pathway5    Q         G           D            NA           E
      Pathway1    A         D           K            NA           F
      Pathway7    A         B           C            D            F
      Pathway4    V         B           C            D            E

And I want to combine the dataframe so that those rows when are identical from "Protein2" to "Protein4" are condense, giving the following:

            Beginning1 Protein2     Protein3     Protein4     Biomarker1
Pathway3    A,Z         G           NA           NA           F,E
Pathway9    A,Y         G           Z            H            F,E
Pathway2    A,Q         G           D            NA           F,E
Pathway1    A           D           K            NA           F
Pathway7    A,V         B           C            D            F,E

This is very similar to a question that I asked before (Consolidating duplicate rows in a dataframe), however the difference is that I am also consolidating the "Beginning1" row.

So far, I have tried:

library(dat.table)
dat<-data.table(df)

Total_collapse <- dat[, .(
Biomarker1 = paste0(Biomarker1, collapse = ", ")),
by = .(Beginning1, Protein1, Protein2, Protein3)]

Total_collapse <- dat[, .(
Beginning1 = paste0(Beginning1, collapse = ", ")),
by = .(Protein1, Protein2, Protein3)]

which gives the output:

            Beginning1  Protein2    Protein3      Protein4      Biomarker1
Pathway3    G           NA           NA           F,E
Pathway9    G           Z            H            F,E
Pathway2    G           D            NA           F,E
Pathway1    D           K            NA           F
Pathway7    B           C            D            F,E

Does anyone know how to fix this problem? I have also tried duplicating the solution from Collapse / concatenate / aggregate a column to a single comma separated string within each group, but have had no success.

I am sorry if it is a simple error- I am pretty new to R.

Frank
  • 66,179
  • 8
  • 96
  • 180
Taylor Maurer
  • 239
  • 1
  • 3
  • 9
  • Thank you for the edits. I am still struggling with how to make dataframes appear as they should be on stackoverflow – Taylor Maurer Jul 25 '17 at 18:03
  • Np. You may need to give a reproducible example (like the first code line in your link, `data <-`) since the difficulty could be some subtle difference in the data itself (like how strings are stored). – Frank Jul 25 '17 at 18:05

3 Answers3

7

Here's a possible solution using dplyr

df %>% group_by_at(vars(Protein2:Protein4)) %>%
  summarize_all(paste, collapse=",")
MrFlick
  • 195,160
  • 17
  • 277
  • 295
2

Using data.table you can use .SD to refer to all columns not specified in the by argument. Then we can use lapply to accomplish the paste() with collapse.

library(data.table)
dt <- read.table(text = "Beginning1 Protein2    Protein3    Biomarker1
                  A         G           NA           NA           F
                  Z         G           NA           NA           E
                  A         G           Z            H            F
                  Y         G           Z            H            E
                  A         G           D            NA           F
                  Q         G           D            NA           E
                  A         D           K            NA           F
                  A         B           C            D            F
                  V         B           C            D            E",header = T)
dt <- data.table(dt)
dt[,lapply(.SD, function(col) paste(col, collapse=", ")), 
    by=.(Protein2, Protein3, Protein4)]

Output

   Protein2 Protein3 Protein4 Beginning1 Biomarker1
1:        G       NA       NA       A, Z       F, E
2:        G        Z        H       A, Y       F, E
3:        G        D       NA       A, Q       F, E
4:        D        K       NA          A          F
5:        B        C        D       A, V       F, E
Adam Spannbauer
  • 2,707
  • 1
  • 17
  • 27
1

We can use aggregate from base R

r1 <- aggregate(cbind(Beginning1, Biomarker1)~., replace(df,is.na(df), "NA"), FUN = toString)
r1
#    Protein2 Protein3 Protein4 Beginning1 Biomarker1
#1        B        C        D       A, V       F, E
#2        G        Z        H       A, Y       F, E
#3        G        D       NA       A, Q       F, E
#4        D        K       NA          A          F
#5        G       NA       NA       A, Z       F, E
r1[r1=="NA"] <- NA
akrun
  • 874,273
  • 37
  • 540
  • 662