Collapse all columns by an ID column

Question

I'm trying to do something similar to what's answered here, which gets me 80% of the way. I have a data frame with one ID column and multiple information columns. I'd like to roll up all of the other columns so that there's only one row for each ID, and multiple entries are separated by, for instance, a semicolon. Here's an example of what I have and what I want.

HAVE:

     ID  info1          info2
1 id101    one          first
2 id102   twoA second alias A
3 id102   twoB second alias B
4 id103 threeA  third alias A
5 id103 threeB  third alias B
6 id104   four         fourth
7 id105   five          fifth

WANT:

     ID          info1                          info2
1 id101            one                          first
2 id102     twoA; twoB second alias A; second alias B
3 id103 threeA; threeB   third alias A; third alias B
4 id104           four                         fourth
5 id105           five                          fifth

Here's the code used to generate those:

have <- data.frame(ID=paste0("id", c(101, 102, 102, 103, 103, 104, 105)),
                   info1=c("one", "twoA", "twoB", "threeA", "threeB", "four", "five"), 
                   info2=c("first", "second alias A", "second alias B", "third alias A", "third alias B", "fourth", "fifth"),
                   stringsAsFactors=FALSE)
want <- data_frame(ID=paste0("id", c(101:105)),
                   info1=c("one", "twoA; twoB", "threeA; threeB", "four", "five"), 
                   info2=c("first", "second alias A; second alias B", "third alias A; third alias B", "fourth", "fifth"),
                   stringsAsFactors=FALSE)

This question asked basically the same question, but only a single "info" column. I have multiple other columns and would like to do this for all of them.

Bonus points for doing this using dplyr.

talat · Accepted Answer · 2014-11-18T17:22:08.657

19

Here's an option using summarise_each (which makes it easy to apply the changes to all columns except the grouping variables) and toString:

require(dplyr)

have %>%
  group_by(ID) %>%
  summarise_each(funs(toString))

#Source: local data frame [5 x 3]
#
#     ID          info1                          info2
#1 id101            one                          first
#2 id102     twoA, twoB second alias A, second alias B
#3 id103 threeA, threeB   third alias A, third alias B
#4 id104           four                         fourth
#5 id105           five                          fifth

Or, if you want it separated by semicolons, you can use:

have %>%
  group_by(ID) %>%
  summarise_each(funs(paste(., collapse = "; ")))

edited Nov 18 '14 at 17:22

answered Nov 17 '14 at 20:35

talat

68,970
21
126
157

Dammit. Haha you posted this while I was editing my deleted answer. – Rich Scriven Nov 17 '14 at 20:38
3

That's SO life ;-) @RichardScriven – talat Nov 17 '14 at 20:39
It's rough in this hood. This beats mine anyway since I didn't know about `paste(., collapse = "; ")` I just wrote a custom `paste`. Still learning `dplyr` – Rich Scriven Nov 17 '14 at 20:41
Thanks. Mick's solution below also works well. This solution takes about 12 seconds, the solution below using base takes about 14 seconds, for collapsing 43,602 rows to 41,128. – Stephen Turner Nov 17 '14 at 20:50
Welcome, @StephenTurner - seems both solutions are pretty close to each other then. – talat Nov 17 '14 at 20:53
how many seconds does it take you to load dplyr – rawr Nov 17 '14 at 20:59
It's always loaded :) -- @rawr – talat Nov 17 '14 at 21:00
toString did not come to my mind. My breakfast note is this one. Nice! +1 :-) – jazzurro Nov 17 '14 at 23:28

score 12 · Answer 2 · edited Nov 17 '14 at 20:44

12

Good old aggregate does this just fine

aggregate(have[,2:3], by=list(have$ID), paste, collapse=";")

Question is: does it scale?

edited Nov 17 '14 at 20:44

talat

68,970
21
126
157

answered Nov 17 '14 at 20:43

BioMickWatson

121
2

+1 for the base R solution – talat Nov 17 '14 at 20:44
cleaner than mine: `do.call(cbind, lapply(names(have[-1]), function(ii) by(have, have$ID, FUN = function(x) paste(x[[ii]], collapse = '; '))))` – rawr Nov 17 '14 at 20:48
5

The formula method is a bit cleaner `aggregate(. ~ ID, have, paste, collapse = "; ")` and might be faster – Rich Scriven Nov 17 '14 at 20:53

Rich Scriven · Answer 3 · 2014-11-18T04:22:57.963

10

Here's a data.table solution.

library(data.table)
setDT(have)[, lapply(.SD, paste, collapse = "; "), by = ID]
#       ID          info1                          info2
# 1: id101            one                          first
# 2: id102     twoA; twoB second alias A; second alias B
# 3: id103 threeA; threeB   third alias A; third alias B
# 4: id104           four                         fourth
# 5: id105           five                          fifth

edited Nov 18 '14 at 04:22

answered Nov 17 '14 at 20:31

Rich Scriven

97,041
11
181
245

Is this your first data.table answer? +1 :-) – jazzurro Nov 17 '14 at 23:26
@jazzurro - Thanks. I think it's my second or third. I'm learning it all!! – Rich Scriven Nov 17 '14 at 23:28
@RichardScriven I see. I am learning data.table too. I wrote two answers with data.table so far. I can often read codes, but writing is still a bit different story. I gotta catch up! – jazzurro Nov 17 '14 at 23:31
@jazzurro I agree. `data.table` is really nice, but also much different than what I'm accustomed to. I'm going through the intro pdf right now. Hopefully I'll add a few more DT answers soon :-) – Rich Scriven Nov 17 '14 at 23:37
@RichardScriven Cool. I have the CRAN manual. I also try to take one example from here everyday and take notes. Seeing David's and akrun's works is helping me quite a bit. – jazzurro Nov 17 '14 at 23:43

zx8754 · Answer 4 · 2014-11-18T08:14:13.663

4

Here is SQL solution^1:

library(sqldf)
#Static solution
sqldf("
SELECT ID,
       GROUP_CONCAT(info1,';') as info1,
       GROUP_CONCAT(info2,';') as info2
FROM have
GROUP BY ID")

#Dynamic solution
concat_cols <- colnames(have)[2:ncol(have)]
group_concat <-
  paste(paste0("GROUP_CONCAT(",concat_cols,",';') as ", concat_cols),
        collapse = ",")
sqldf(
  paste("
      SELECT ID,",
      group_concat,"
      FROM have
      GROUP BY ID"))

# Same output for both static and dynamic solutions
#      ID         info1                         info2
# 1 id101           one                         first
# 2 id102     twoA;twoB second alias A;second alias B
# 3 id103 threeA;threeB   third alias A;third alias B
# 4 id104          four                        fourth
# 5 id105          five                         fifth

^1 - probably data.table solution would perform better with millions of rows, luckily we don't have that many genes yet :)

edited Nov 18 '14 at 08:14

answered Nov 18 '14 at 05:07

zx8754

52,746
12
114
209

Is there a way to do this for all columns witzig explicitly writing the code for each (as indicated in the question there are many columns to be collapsed)? – talat Nov 18 '14 at 06:11
@beginneR updated the post with dynamic solution, this is where the code gets ugly... – zx8754 Nov 18 '14 at 08:16
1

The dynamic version could also be written using `fn$` like this: `nms <- names(have)[-1]; grp <- toString(sprintf("group_concat(%s, ';') as %s", nms, nms)); fn$sqldf( "SELECT ID, $grp FROM have GROUP BY ID")` – G. Grothendieck Nov 18 '14 at 14:29
@G.Grothendieck nice one, never heard of `fn$`. How do one come across such hidden gems? – zx8754 Nov 18 '14 at 17:55

Stelios Sfakianakis · Answer 5 · 2014-11-17T21:54:15.243

library(stringr)
library(dplyr)
have %>% tbl_df %>% group_by(ID) %>% summarise_each(funs(str_c(., collapse="; ")))

Edit 1: So tbl_df may not needed and instead of the str_c of the stringr package you could use paste (in base). And what the above does is to group by the ID column and then apply the str_c (or paste) function to each remaining column for each group.

Edit 2: Another solution using the data.table package:

library(data.table)
dtbl <- as.data.table(have)
dtbl[,lapply(.SD, function(x) paste(x,collapse=";")), by=ID]

The above may be faster, especially if you set the key:

setkey(dtbl, ID)

"Hybrid" solution: You can use the dplyr syntax for data.tables! For example:

dtbl %>% tbl_dt %>%
     group_by(ID) %>% 
     summarise_each(funs(paste(., collapse="; ")))

Some explanation would be useful. – Gábor Bakos Nov 17 '14 at 20:58 — Gábor Bakos, Nov 17 '14 at 20:58

Collapse all columns by an ID column

5 Answers5

Linked

Related