1

Basically, I have a dataframe, df

         Beginning1 Protein2    Protein3    Protein4    Biomarker1
Pathway3    A         G           NA           NA           F
Pathway8    A         G           NA           NA           E
Pathway9    A         G           Z            H            F
Pathway6    A         G           Z            H            E
Pathway2    A         G           D            NA           F
Pathway5    A         G           D            NA           E
Pathway1    A         D           K            NA           F
Pathway7    A         B           C            D            F
Pathway4    A         B           C            D            E

And now I want to consolidate the rows to look like this:

newdf
      Beginning1    Protein2    Protein3    Protein4    Biomarker1
Pathway3    A         G           NA           NA           F, E
Pathway9    A         G           Z            H            F, E
Pathway2    A         G           D            NA           F, E
Pathway1    A         D           K            NA           F
Pathway4    A         B           C            D            F, E

This is a continuation of a past question I asked (Consolidating duplicate rows in a dataframe). That works for this dataset, but for my much larger dataset it does not seem to combine the values. For example, the first few lines of output (after I modified the code given by @Matt Jewett or used the explanations provided in Concatenate strings by group with dplyr):

          Beginning1    Protein2    Protein3    Protein4    Biomarker1
Pathway1    Smoothened    Gl-1                              Osteopontin
Pathway2    Smoothened    Gl-1      BMP2                    Osteopontin
Pathway3    Smoothened    Gl-1      BMP2                    DLX5
Pathway4    Smoothened    Gl-1      BMP2                    Osteopontin

As you can see, there are several problems. First, the Biomarker1 column doesn't seem to be aggregating. And secondly, there are repeats of several rows. I have hit a wall in terms of solutions, so any solutions you guys can think up would be much appreciated!

Thank you so much for your help!

Taylor Maurer
  • 239
  • 1
  • 3
  • 9

1 Answers1

1

Simple enough using data.table

library(data.table)

dat <- fread("Pathway Beginning1 Protein2    Protein3    Protein4    Biomarker1
             Pathway3    A         G           NA           NA           F
             Pathway8    A         G           NA           NA           E
             Pathway9    A         G           Z            H            F
             Pathway6    A         G           Z            H            E
             Pathway2    A         G           D            NA           F
             Pathway5    A         G           D            NA           E
             Pathway1    A         D           K            NA           F
             Pathway7    A         B           C            D            F
             Pathway4    A         B           C            D            E")

dat_collapse <- dat[, .(Pathway = Pathway[1],
                        Biomarker1 = paste0(Biomarker1, collapse = ", ")),
                    by = .(Beginning1, Protein2, Protein3, Protein4)]

setcolorder(dat_collapse, names(dat))
dat_collapse 

Results in:

    Pathway Beginning1 Protein2 Protein3 Protein4 Biomarker1
1: Pathway3          A        G       NA       NA       F, E
2: Pathway9          A        G        Z        H       F, E
3: Pathway2          A        G        D       NA       F, E
4: Pathway1          A        D        K       NA          F
5: Pathway7          A        B        C        D       F, E
Eric Watt
  • 3,180
  • 9
  • 21
  • Sorry for the late response, but this doesn't seem to be working for me. I keep getting this error: unused argument (by = .(Beginning1, Protein2, Protein3, Protein4)) – Taylor Maurer Jul 13 '17 at 17:20
  • Are you using a `data.table` object? The function `fread` will create one; if you already have a `data.frame` named `df` you can make it a data.table with `setDT(df)` or you can make a copy with `dat <- data.table(df)`. – Eric Watt Jul 13 '17 at 17:25
  • it is not letting me creat one using fread. When I try fread(df) where df is the name of my dataframe I get this error: Error in fread(all) : 'input' must be a single character string containing a file name, a command, full path to a file, a URL starting 'http[s]://', 'ftp[s]://' or 'file://', or the input data itself. – Taylor Maurer Jul 13 '17 at 17:29
  • Is there another function that could be used to make my data.frame a data.table object? – Taylor Maurer Jul 13 '17 at 17:31
  • Ah, don't use `fread` for that, it's used to convert text or system files to a data.table. Use `setDT` or `data.table` to convert your `data.frame`. – Eric Watt Jul 13 '17 at 17:36
  • `dat <- data.table(df)` where `df` is the name of your `data.frame`. – Eric Watt Jul 13 '17 at 17:36
  • Thank you, but when I use dat_collapse <- dat[, .(Pathway = Pathway[1], + Biomarker1 = paste0(Biomarker1, collapse = ", ")), + by = .(Beginning1, Protein2, Protein3)] then I get this error: object 'Pathway' not found. I'm sorry if I am not understanding something really simple- I am pretty new to R. – Taylor Maurer Jul 13 '17 at 17:38
  • Update: it works if I remove the function Pathway=Pathway[1]. What was your reasoning for adding that to the code? – Taylor Maurer Jul 13 '17 at 17:41
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/149160/discussion-between-eric-watt-and-taylor-maurer). – Eric Watt Jul 13 '17 at 17:51