Collapse data.table based on several columns

Question

I have a very large data.table with 4 columns, only one of which is totally unique. Hard to explain but it looks like this:

          ENSEMBL ENTREZID SYMBOL   ALIAS
1 ENSG00000000003     7105 TSPAN6    T245
2 ENSG00000000003     7105 TSPAN6  TM4SF6
3 ENSG00000000003     7105 TSPAN6 TSPAN-6
4 ENSG00000000003     7105 TSPAN6  TSPAN6
5 ENSG00000000005    64102   TNMD  BRICD4
6 ENSG00000000005    64102   TNMD   CHM1L

I'm trying to collapse it based on the first 3 columns, and make the ALIAS column a list of the values that were collapsed.

I know I can do results <- setDT(df)[, list(ALIAS=paste(ALIAS, collapse = '|')) , ENSEMBL], but then I lose the middle two cols.

If SYMBOL and ENTREZID are the same for each ENSEMBL, use that also as the grouping variable i.e. `setDT(df)[ , .(ALIAS = paste(ALIAS, collapse="|")) , c(names(df)[1:3])]` — akrun, Sep 14 '16 at 16:27

score 2 · Accepted Answer · answered Sep 14 '16 at 16:43

2

You are close. Try

DT[, paste0(ALIAS, collapse="|"), by=c("ENSEMBL", "ENTREZID", "SYMBOL")]

With the data you have given, this gave me

           ENSEMBL ENTREZID SYMBOL                         V1
1: ENSG00000000003     7105 TSPAN6 T245|TM4SF6|TSPAN-6|TSPAN6
2: ENSG00000000005    64102   TNMD               BRICD4|CHM1L

answered Sep 14 '16 at 16:43

Sun Bee

1,595
15
22

figured this out seconds after posting – Ethan Sep 15 '16 at 18:46

Eric Watt · Answer 2 · 2016-09-14T16:40:49.177

1

In this case, add all the columns you want to keep to the by section.

results <- setDT(df)[, 
                     list(ALIAS=paste(ALIAS, collapse = '|')), 
                     list(ENSEMBL, ENTREZID, SYMBOL)]

edited Sep 14 '16 at 16:40

answered Sep 14 '16 at 16:31

Eric Watt

3,180
9
21

Collapse data.table based on several columns

2 Answers2