7

Suppose that I have this input :

             ID     date_1      date_2     str
1            1    2010-07-04  2008-01-20   A
2            2    2015-07-01  2011-08-31   C
3            3    2015-03-06  2013-01-18   D
4            4    2013-01-10  2011-08-30   D
5            5    2014-06-04  2011-09-18   B
6            5    2014-06-04  2011-09-18   B
7            6    2012-11-22  2011-09-28   C
8            7    2014-06-17  2013-08-04   A
10           7    2014-06-17  2013-08-04   B
11           7    2014-06-17  2013-08-04   B

I would like to progressively concatenate the values of the str column by the group variable ID, as showed in the following output :

             ID     date_1      date_2     str
1            1    2010-07-04  2008-01-20   A
2            2    2015-07-01  2011-08-31   C
3            3    2015-03-06  2013-01-18   D
4            4    2013-01-10  2011-08-30   D
5            5    2014-06-04  2011-09-18   B
6            5    2014-06-04  2011-09-18   B,B
7            6    2012-11-22  2011-09-28   C
8            7    2014-06-17  2013-08-04   A
10           7    2014-06-17  2013-08-04   A,B
11           7    2014-06-17  2013-08-04   A,B,B

I tried to use the ave() function with this code :

within(table, {
  Emp_list <- ave(str, ID, FUN = function(x) paste(x, collapse = ","))
})

but it gives the following output, which is not exactly what I want :

         ID      date_1     date_2      str
1         1    2010-07-04 2008-01-20     A
2         2    2015-07-01 2011-08-31     C
3         3    2015-03-06 2013-01-18     D
4         4    2013-01-10 2011-08-30     D
5         5    2014-06-04 2011-09-18     B,B
6         5    2014-06-04 2011-09-18     B,B
7         6    2012-11-22 2011-09-28     C
8         7    2014-06-17 2013-08-04     A,B,B
10        7    2014-06-17 2013-08-04     A,B,B
11        7    2014-06-17 2013-08-04     A,B,B

Of course I'd like to avoid loops, as I work on a large database.

Liam S.
  • 101
  • 5

2 Answers2

9

How about ave() with Reduce(). The Reduce() function allows us to accumulate results as they are calculated. So if we run it with paste() we can accumulate the pasted strings.

f <- function(x) {
    Reduce(function(...) paste(..., sep = ", "), x, accumulate = TRUE)
}

df$str <- with(df, ave(as.character(str), ID, FUN = f)

which gives the updated data frame df

   ID     date_1     date_2     str
1   1 2010-07-04 2008-01-20       A
2   2 2015-07-01 2011-08-31       C
3   3 2015-03-06 2013-01-18       D
4   4 2013-01-10 2011-08-30       D
5   5 2014-06-04 2011-09-18       B
6   5 2014-06-04 2011-09-18    B, B
7   6 2012-11-22 2011-09-28       C
8   7 2014-06-17 2013-08-04       A
10  7 2014-06-17 2013-08-04    A, B
11  7 2014-06-17 2013-08-04 A, B, B

Note: function(...) paste(..., sep = ", ") could also be function(x, y) paste(x, y, sep = ", "). (Thanks Pierre Lafortune)

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • 1
    Dammit, beat me to it as I was trying to figure out the `...` in the `paste` function. Nicely done. – thelatemail Jan 13 '16 at 23:00
  • Haha, you beat me as well with the `Reduce` part. ;) – cryo111 Jan 13 '16 at 23:02
  • @RichardScriven - `Reduce` needs a binary (or binary compatible) function like `+` etc to work properly. Using `...` allows this to happen with `paste`, otherwise it screws up using just `function(x) paste(x,sep=",")` as two arguments are expected by `Reduce`. – thelatemail Jan 13 '16 at 23:05
  • 2
    @RichardScriven - I know you have - I was just explaining that `function(...)` is necessary instead of the possibly more obvious `function(x)` - you said you were figuring out how to explain it so I thought I'd chime in. – thelatemail Jan 13 '16 at 23:09
  • 1
    Nicely done ! that's exactly what I wanted :) Thanks @RichardScriven – Liam S. Jan 13 '16 at 23:15
  • 1
    You don't need the dots for it to work, `function(x,y) paste(x,y, sep=", ")` should work just fine – Pierre L Jan 14 '16 at 00:51
  • @PierreLafortune - true, you could make it a binary function explicitly too. – thelatemail Jan 14 '16 at 00:56
8

Here's a possible solution combining data.table with an inner tapply that seem to get you what you need (you can use paste instead of toString if you like, it just looks cleaner to me that way).

library(data.table)
setDT(df)[, Str := tapply(str[sequence(1:.N)], rep(1:.N, 1:.N), toString), by = ID]
df
#     ID     date_1     date_2 str     Str
#  1:  1 2010-07-04 2008-01-20   A       A
#  2:  2 2015-07-01 2011-08-31   C       C
#  3:  3 2015-03-06 2013-01-18   D       D
#  4:  4 2013-01-10 2011-08-30   D       D
#  5:  5 2014-06-04 2011-09-18   B       B
#  6:  5 2014-06-04 2011-09-18   B    B, B
#  7:  6 2012-11-22 2011-09-28   C       C
#  8:  7 2014-06-17 2013-08-04   A       A
#  9:  7 2014-06-17 2013-08-04   B    A, B
# 10:  7 2014-06-17 2013-08-04   B A, B, B

You may be able to improve it a bit using

setDT(df)[, Str := {Len <- 1:.N ; tapply(str[sequence(Len)], rep(Len, Len), toString)}, by = ID]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196