0

Hi I need to concatenate strings by groups for multiple columns. I realise that versions of this question has been asked several times (see Aggregating by unique identifier and concatenating related values into a string), but they usually involve concatenating values of a single column.

My dataset is something like:

Sample  group   Gene1   Gene2   Gene3
A       1       a       NA      NA
A       2       b       NA      NA
B       1       NA      c       NA
C       1       a       NA      d
C       2       b       NA      e
C       3       c       NA      NA

I want to get it into a format where each samples takes only 1 row (the group column is optional):

Sample  group   Gene1   Gene2   Gene3
A       1,2     a,b     NA      NA
B       1       NA      c       NA
C       1,2,3   a,b,c   NA      d,e

Since the number of genes can go up to the thousands, I can't simply specify the columns that I wish to concatenate. I know aggregate or dplyr can be used to get the groups but I can't figure out how to do it for multiple columns.

Thanks in advance!

Edit

As my dataset is very large containing thousands of genes, I realised dplyr is too slow. I've been experimenting with data.table and the following code can also get what I want:

setDT(df)[, lapply(.SD, function(x) paste(na.omit(x), collapse = ",")), by = Sample]

The output is now:

   Sample group Gene1 Gene2 Gene3
1:      A   1,2   a,b            
2:      B     1           c      
3:      C 1,2,3 a,b,c         d,e

Thanks for all your help!

Community
  • 1
  • 1
phusion
  • 97
  • 1
  • 9

2 Answers2

2

For these purposes, there are the summarise_all, summarise_at, and summarise_if functions. Using summarise_all:

df %>%
  group_by(Sample) %>%
  summarise_all(funs(paste(na.omit(.), collapse = ",")))
# A tibble: 3 × 5
  Sample group Gene1 Gene2 Gene3
   <chr> <chr> <chr> <chr> <chr>
1      A   1,2   a,b            
2      B     1           c      
3      C 1,2,3 a,b,c         d,e

Update: In current versions of dplyr, it is encouraged to combine summarise with across, e.g. like this:

df %>%
  group_by(Sample) %>%
  summarise(across(everything(), \(x) paste(na.omit(x), collapse = ",")))
Axeman
  • 32,068
  • 8
  • 81
  • 94
  • Thank you both for your answers! However my dataset is very large and dplyr takes a long time to run. I realised that my question can also be solved using data.table: http://stackoverflow.com/a/26981425/6782374. However I do not know how to omit the NAs. – phusion Feb 17 '17 at 09:09
  • I found a way already and edited my question to reflect the code I eventually used. Thanks for your help! – phusion Feb 20 '17 at 01:48
1

using dplyr, you can try:

dft %>%
  group_by(Sample) %>%
  summarise_each(funs( toString(unique(.))))

which gives:

# A tibble: 3 × 5
  Sample   group   Gene1 Gene2    Gene3
   <chr>   <chr>   <chr> <chr>    <chr>
1      A    1, 2    a, b    NA       NA
2      B       1      NA     c       NA
3      C 1, 2, 3 a, b, c    NA d, e, NA

Edit: @Axeman had the right idea with using na.omit(.) to get rid of null values

Aramis7d
  • 2,444
  • 19
  • 25