Concatenate duplicate dataframe values in R

Question

I have a very long dataframe where 1 column out of nearly 56 has many different values, while the rest of the data change in accordance with the first column ID. Here's an example

ID  chrom   left    right   ref_seq var_type    zygosity    transcript_name
0   chr1    1590327 1590328 a       SNP         Hom         NM_033486
0   chr1    1590327 1590328 a       SNP         Hom         NM_033487
0   chr1    1590327 1590328 a       SNP         Hom         NM_033488
0   chr1    1590327 1590328 a       SNP         Hom         NM_033489
0   chr1    1590327 1590328 a       SNP         Hom         NM_033492
0   chr1    1590327 1590328 a       SNP         Hom         NM_033493
1   chr1    1590526 1590527 g       SNP         Hom         NM_033486
1   chr1    1590526 1590527 g       SNP         Hom         NM_033487
1   chr1    1590526 1590527 g       SNP         Hom         NM_033488
1   chr1    1590526 1590527 g       SNP         Hom         NM_033489
1   chr1    1590526 1590527 g       SNP         Hom         NM_033492

The desired result would be to concatenate any duplicate values into a comma seperated string but maintain the ID only once, like this

ID  chrom   left    right   ref_seq var_type    zygosity    transcript_name
0   chr1    1590327 1590328 a       SNP         Hom         NM_033486NM_033487,NM_033488,NM_033489,NM_033492,NM_033493
1   chr1    1590526 1590527 g       SNP         Hom         NM_033486,NM_033487,NM_033488,NM_033489,NM_033492

I've searched for similar questions and the following solutions haven't worked so far; instead they return me a zero row dataframe.

Why do you get all 0? Can you show your script that doesn't work? — Sotos, Jul 08 '16 at 13:54
The following code worked for me, assuming your working data frame is the same as provided. `df2 <- aggregate(df[,8], df[,-8], FUN = function(X) paste(unique(X), collapse=", "))` — Dave Gruenewald, Jul 08 '16 at 14:11

user2100721 · Answer 1 · 2016-07-08T14:19:42.387

8

Another solution using base R

aggregate(data=df,transcript_name~.,FUN=paste,collapse=",")

Thanks to @Sotos & @LyzandeR for collapse

edited Jul 08 '16 at 14:19

answered Jul 08 '16 at 14:03

user2100721

3,557
2
20
29

1

You also need to `collapse` – Sotos Jul 08 '16 at 14:04
1

@user2100721 use `str(aggregate(data=df,transcript_name~.,FUN=paste))` and you ll see the difference – LyzandeR Jul 08 '16 at 14:10
or better yet, save both versions `df1 <- aggregate(...paste)` and `df2 <- aggregate(...paste, collapse = ',')` and `View` the results – Sotos Jul 08 '16 at 14:12
2

You could just use `toString` instead – talat Jul 08 '16 at 14:17
1

@LyzandeR & Sotos Ok. I got that. Thanks. – user2100721 Jul 08 '16 at 14:19
I get the following message "Error in aggregate.data.frame(mf[1L], mf[-1L], FUN = FUN, ...) : no rows to aggregate" – civy Jul 08 '16 at 14:54
This is weird. Does anyone have any idea? – user2100721 Jul 08 '16 at 16:16

score 4 · Accepted Answer · edited Jul 08 '16 at 14:53

One way with data.table:

library(data.table)
#setDT will convert the data.frame into data.table
#.SD gives you access to the groups of data.tables created by the 'by' argument
setDT(df)[, list(transcript_name = paste(transcript_name, collapse = ', ')), 
            by = c('ID', 'chrom', 'left', 'right', 'ref_seq', 'var_type', 'zygosity')]
#   ID chrom    left   right ref_seq var_type zygosity                                                  transcript_name
#1:  0  chr1 1590327 1590328       a      SNP      Hom NM_033486, NM_033487, NM_033488, NM_033489, NM_033492, NM_033493
#2:  1  chr1 1590526 1590527       g      SNP      Hom            NM_033486, NM_033487, NM_033488, NM_033489, NM_033492

Data

df <- read.table(header = TRUE, text = 'ID  chrom   left    right   ref_seq var_type    zygosity    transcript_name
0   chr1    1590327 1590328 a   SNP Hom NM_033486
                 0   chr1    1590327 1590328 a   SNP Hom NM_033487
                 0   chr1    1590327 1590328 a   SNP Hom NM_033488
                 0   chr1    1590327 1590328 a   SNP Hom NM_033489
                 0   chr1    1590327 1590328 a   SNP Hom NM_033492
                 0   chr1    1590327 1590328 a   SNP Hom NM_033493
                 1   chr1    1590526 1590527 g   SNP Hom NM_033486
                 1   chr1    1590526 1590527 g   SNP Hom NM_033487
                 1   chr1    1590526 1590527 g   SNP Hom NM_033488
                 1   chr1    1590526 1590527 g   SNP Hom NM_033489
                 1   chr1    1590526 1590527 g   SNP Hom NM_033492')

Yeah right! It is my usual way of over-complicating things :P. Thanks for the comment / edit guys. @docendodiscimus — LyzandeR, Jul 08 '16 at 14:54

Concatenate duplicate dataframe values in R

2 Answers2