2

I have a very large data.table with 4 columns, only one of which is totally unique. Hard to explain but it looks like this:

          ENSEMBL ENTREZID SYMBOL   ALIAS
1 ENSG00000000003     7105 TSPAN6    T245
2 ENSG00000000003     7105 TSPAN6  TM4SF6
3 ENSG00000000003     7105 TSPAN6 TSPAN-6
4 ENSG00000000003     7105 TSPAN6  TSPAN6
5 ENSG00000000005    64102   TNMD  BRICD4
6 ENSG00000000005    64102   TNMD   CHM1L

I'm trying to collapse it based on the first 3 columns, and make the ALIAS column a list of the values that were collapsed.

I know I can do results <- setDT(df)[, list(ALIAS=paste(ALIAS, collapse = '|')) , ENSEMBL], but then I lose the middle two cols.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Ethan
  • 387
  • 1
  • 2
  • 13
  • 1
    If SYMBOL and ENTREZID are the same for each ENSEMBL, use that also as the grouping variable i.e. `setDT(df)[ , .(ALIAS = paste(ALIAS, collapse="|")) , c(names(df)[1:3])]` – akrun Sep 14 '16 at 16:27

2 Answers2

2

You are close. Try

DT[, paste0(ALIAS, collapse="|"), by=c("ENSEMBL", "ENTREZID", "SYMBOL")]

With the data you have given, this gave me

           ENSEMBL ENTREZID SYMBOL                         V1
1: ENSG00000000003     7105 TSPAN6 T245|TM4SF6|TSPAN-6|TSPAN6
2: ENSG00000000005    64102   TNMD               BRICD4|CHM1L
Sun Bee
  • 1,595
  • 15
  • 22
1

In this case, add all the columns you want to keep to the by section.

results <- setDT(df)[, 
                     list(ALIAS=paste(ALIAS, collapse = '|')), 
                     list(ENSEMBL, ENTREZID, SYMBOL)]
Eric Watt
  • 3,180
  • 9
  • 21