4

I have a very long dataframe where 1 column out of nearly 56 has many different values, while the rest of the data change in accordance with the first column ID. Here's an example

ID  chrom   left    right   ref_seq var_type    zygosity    transcript_name
0   chr1    1590327 1590328 a       SNP         Hom         NM_033486
0   chr1    1590327 1590328 a       SNP         Hom         NM_033487
0   chr1    1590327 1590328 a       SNP         Hom         NM_033488
0   chr1    1590327 1590328 a       SNP         Hom         NM_033489
0   chr1    1590327 1590328 a       SNP         Hom         NM_033492
0   chr1    1590327 1590328 a       SNP         Hom         NM_033493
1   chr1    1590526 1590527 g       SNP         Hom         NM_033486
1   chr1    1590526 1590527 g       SNP         Hom         NM_033487
1   chr1    1590526 1590527 g       SNP         Hom         NM_033488
1   chr1    1590526 1590527 g       SNP         Hom         NM_033489
1   chr1    1590526 1590527 g       SNP         Hom         NM_033492

The desired result would be to concatenate any duplicate values into a comma seperated string but maintain the ID only once, like this

ID  chrom   left    right   ref_seq var_type    zygosity    transcript_name
0   chr1    1590327 1590328 a       SNP         Hom         NM_033486NM_033487,NM_033488,NM_033489,NM_033492,NM_033493
1   chr1    1590526 1590527 g       SNP         Hom         NM_033486,NM_033487,NM_033488,NM_033489,NM_033492

I've searched for similar questions and the following solutions haven't worked so far; instead they return me a zero row dataframe.

Community
  • 1
  • 1
civy
  • 393
  • 2
  • 17
  • Why do you get all 0? Can you show your script that doesn't work? – Sotos Jul 08 '16 at 13:54
  • The following code worked for me, assuming your working data frame is the same as provided. `df2 <- aggregate(df[,8], df[,-8], FUN = function(X) paste(unique(X), collapse=", "))` – Dave Gruenewald Jul 08 '16 at 14:11

2 Answers2

8

Another solution using base R

aggregate(data=df,transcript_name~.,FUN=paste,collapse=",")

Thanks to @Sotos & @LyzandeR for collapse

user2100721
  • 3,557
  • 2
  • 20
  • 29
4

One way with data.table:

library(data.table)
#setDT will convert the data.frame into data.table
#.SD gives you access to the groups of data.tables created by the 'by' argument
setDT(df)[, list(transcript_name = paste(transcript_name, collapse = ', ')), 
            by = c('ID', 'chrom', 'left', 'right', 'ref_seq', 'var_type', 'zygosity')]
#   ID chrom    left   right ref_seq var_type zygosity                                                  transcript_name
#1:  0  chr1 1590327 1590328       a      SNP      Hom NM_033486, NM_033487, NM_033488, NM_033489, NM_033492, NM_033493
#2:  1  chr1 1590526 1590527       g      SNP      Hom            NM_033486, NM_033487, NM_033488, NM_033489, NM_033492

Data

df <- read.table(header = TRUE, text = 'ID  chrom   left    right   ref_seq var_type    zygosity    transcript_name
0   chr1    1590327 1590328 a   SNP Hom NM_033486
                 0   chr1    1590327 1590328 a   SNP Hom NM_033487
                 0   chr1    1590327 1590328 a   SNP Hom NM_033488
                 0   chr1    1590327 1590328 a   SNP Hom NM_033489
                 0   chr1    1590327 1590328 a   SNP Hom NM_033492
                 0   chr1    1590327 1590328 a   SNP Hom NM_033493
                 1   chr1    1590526 1590527 g   SNP Hom NM_033486
                 1   chr1    1590526 1590527 g   SNP Hom NM_033487
                 1   chr1    1590526 1590527 g   SNP Hom NM_033488
                 1   chr1    1590526 1590527 g   SNP Hom NM_033489
                 1   chr1    1590526 1590527 g   SNP Hom NM_033492')
Arun
  • 116,683
  • 26
  • 284
  • 387
LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • Yeah right! It is my usual way of over-complicating things :P. Thanks for the comment / edit guys. @docendodiscimus – LyzandeR Jul 08 '16 at 14:54