15

I have a problem dealing with a data frame in R. I would like to paste the contents of cells in different rows together based on the values of the cells in another column. My problem is that I want the output to be progressively (cumulatively) printed. The output vector must be of the same length as the input vector. Here is a sampel table similar to the one I am dealing with:

id <- c("a", "a", "a", "b", "b", "b")
content <- c("A", "B", "A", "B", "C", "B")
(testdf <- data.frame(id, content, stringsAsFactors=FALSE))
#  id content
#1  a       A
#2  a       B
#3  a       A
#4  b       B
#5  b       C
#6  b       B

And this is want I want the result to look like:

result <- c("A", "A B", "A B A", "B", "B C", "B C B") 
result

#[1] "A"     "A B"   "A B A" "B"     "B C"   "B C B"

What I do NOT need something like this:

ddply(testdf, .(id), summarize, content_concatenated = paste(content, collapse = " "))

#  id content_concatenated
#1  a                A B A
#2  b                B C B
Henrik
  • 65,555
  • 14
  • 143
  • 159
user3860074
  • 151
  • 1
  • 3
  • 4
    You want something like a "cumulative paste". You could use `Reduce`: `ave(as.character(testdf$content), testdf$id, FUN = function(x) Reduce(paste, x, acc = T))` – alexis_laz Jul 21 '14 at 09:51
  • 2
    @alexis_laz, this is the comment box, not the answer box :-) – A5C1D2H2I1M1N2O1R2T1 Jul 21 '14 at 09:54
  • @alexis_laz: great! it worked! thanks a lot! all the best – user3860074 Jul 21 '14 at 09:57
  • 1
    @AnandaMahto : I tend to see it as the "it-has-to-be-in-SO-somewhere-but-I'm-too-lazy-to-search" box :P – alexis_laz Jul 21 '14 at 10:03
  • 4
    @alexis_laz, but people don't read comments for the answers. So unless you're going to do the work of finding a duplicate to mark, it's much more helpful to the community if you took the 10 seconds to post an answer, let the OP accept it, and show clearly that this is a resolved question.... At least that's my perspective. – A5C1D2H2I1M1N2O1R2T1 Jul 21 '14 at 10:40

6 Answers6

36

You could define a "cumulative paste" function using Reduce:

cumpaste = function(x, .sep = " ") 
          Reduce(function(x1, x2) paste(x1, x2, sep = .sep), x, accumulate = TRUE)

cumpaste(letters[1:3], "; ")
#[1] "a"       "a; b"    "a; b; c"

Reduce's loop avoids re-concatenating elements from the start as it elongates the previous concatenation by the next element.

Applying it by group:

ave(as.character(testdf$content), testdf$id, FUN = cumpaste)
#[1] "A"     "A B"   "A B A" "B"     "B C"   "B C B"

Another idea, could to concatenate the whole vector at start and, then, progressively substring:

cumpaste2 = function(x, .sep = " ")
{
    concat = paste(x, collapse = .sep)
    substring(concat, 1L, cumsum(c(nchar(x[[1L]]), nchar(x[-1L]) + nchar(.sep))))
}
cumpaste2(letters[1:3], " ;@-")
#[1] "a"           "a ;@-b"      "a ;@-b ;@-c"

This seems to be somewhat faster, too:

set.seed(077)
X = replicate(1e3, paste(sample(letters, sample(0:5, 1), TRUE), collapse = ""))
identical(cumpaste(X, " --- "), cumpaste2(X, " --- "))
#[1] TRUE
microbenchmark::microbenchmark(cumpaste(X, " --- "), cumpaste2(X, " --- "), times = 30)
#Unit: milliseconds
#                  expr      min       lq     mean   median       uq      max neval cld
#  cumpaste(X, " --- ") 21.19967 21.82295 26.47899 24.83196 30.34068 39.86275    30   b
# cumpaste2(X, " --- ") 14.41291 14.92378 16.87865 16.03339 18.56703 23.22958    30  a

...which makes it the cumpaste_faster.

alexis_laz
  • 12,884
  • 4
  • 27
  • 37
  • 27
    ***Terrible*** name for a function, but +1 for the answer (and thanks for bearing with my comments) :-) – A5C1D2H2I1M1N2O1R2T1 Jul 21 '14 at 13:33
  • 1
    @AnandaMahto : Ha, I knew I had to go with "foo", but sometimes you have to call it as you see it :). (or, in this case, as you use it..) – alexis_laz Jul 21 '14 at 17:44
  • How should I modify cumpaste2 function if I want to group by two columns? e.g., testdf$id and testdf$id2 ? – jackDanielle Apr 14 '22 at 03:21
  • @jackDanielle : If I understand correctly, you don't need to modify the function. Just group by both columns; e.g. something like `ave(x, id, id2, FUN = cumpaste)` – alexis_laz Apr 14 '22 at 10:10
  • @alexis_laz ```ave(x, id, id2, FUN = cumpaste)``` works but when I use compaste2 function it gives me the subscript out of bound error from the function itself. I'm guessing I need to change ```substring(concat, 1L, ... ```) part.. – jackDanielle Apr 14 '22 at 15:19
  • @jackDanielle : I think I've debugged the issue. It seems that adding `if(!length(x)) return(x)` before all in the function fixes it? – alexis_laz Apr 14 '22 at 19:30
5

data.table solution

library(data.table)
setDT(testdf)[, content2 := sapply(seq_len(.N), function(x) paste(content[seq_len(x)], collapse = " ")), by = id]
testdf

##    id content content2
## 1:  a       A        A
## 2:  a       B      A B
## 3:  a       A    A B A
## 4:  b       B        B
## 5:  b       C      B C
## 6:  b       B    B C B
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
4

One option using dplyr and purrr could be:

testdf %>%
 group_by(id) %>%
 transmute(content_concatenated = accumulate(content, ~ paste(.x, .y)))

  id    content_concatenated
  <chr> <chr>               
1 a     A                   
2 a     A B                 
3 a     A B A               
4 b     B                   
5 b     B C                 
6 b     B C B  
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
2

Here's a ddply method using sapply and subsetting to paste together incrementally:

library(plyr)
ddply(testdf, .(id), mutate, content_concatenated = sapply(seq_along(content), function(x) paste(content[seq(x)], collapse = " ")))
  id content content_concatenated
1  a       A                    A
2  a       B                  A B
3  a       A                A B A
4  b       B                    B
5  b       C                  B C
6  b       B                B C B
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
James
  • 65,548
  • 14
  • 155
  • 193
2

You may also try dplyr

 library(dplyr)
 res <- testdf%>%
        mutate(n=row_number()) %>%
        group_by(id) %>%
        mutate(n1=n[1L]) %>%
        rowwise() %>% 
        do(data.frame(cont_concat= paste(content[.$n1:.$n],collapse=" "),stringsAsFactors=F))

 res$cont_concat
 #[1] "A"     "A B"   "A B A" "B"     "B C"   "B C B"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    This only seems to work because `content` is manually defined as a vector above. Is there a way to do it all within dplyr without having to refer to objects outside the dataframe? – Erik Shilts Jul 19 '17 at 02:22
  • @ErikShilts Please post as a new question – akrun Jul 19 '17 at 04:46
0

For cumulative functions I recommend runner package with runner function which can apply any algorithm on cumulative window. It can't compete with @alexis_laz solution in terms of speed, but if one needs window of certain size, lag or windows dependent on date - I would suggest to use runner.


id <- c("a", "a", "a", "b", "b", "b")
content <- c("A", "B", "A", "B", "C", "B")
testdf <- data.frame(id, content, stringsAsFactors=FALSE)

library(runner)
library(dplyr)
testdf %>%
  group_by(id) %>%
  mutate(
    result = runner(x = content, 
                    f = function(x) paste(x, collapse = " "),
                    type = "character")) # specify output type - by default numeric

For more go to documentation and vignettes

GoGonzo
  • 2,637
  • 1
  • 18
  • 25