Cumulatively paste (concatenate) values grouped by another variable

Question

I have a problem dealing with a data frame in R. I would like to paste the contents of cells in different rows together based on the values of the cells in another column. My problem is that I want the output to be progressively (cumulatively) printed. The output vector must be of the same length as the input vector. Here is a sampel table similar to the one I am dealing with:

id <- c("a", "a", "a", "b", "b", "b")
content <- c("A", "B", "A", "B", "C", "B")
(testdf <- data.frame(id, content, stringsAsFactors=FALSE))
#  id content
#1  a       A
#2  a       B
#3  a       A
#4  b       B
#5  b       C
#6  b       B

And this is want I want the result to look like:

result <- c("A", "A B", "A B A", "B", "B C", "B C B") 
result

#[1] "A"     "A B"   "A B A" "B"     "B C"   "B C B"

What I do NOT need something like this:

ddply(testdf, .(id), summarize, content_concatenated = paste(content, collapse = " "))

#  id content_concatenated
#1  a                A B A
#2  b                B C B

You want something like a "cumulative paste". You could use `Reduce`: `ave(as.character(testdf$content), testdf$id, FUN = function(x) Reduce(paste, x, acc = T))` — alexis_laz, Jul 21 '14 at 09:51
@alexis_laz, this is the comment box, not the answer box :-) — A5C1D2H2I1M1N2O1R2T1, Jul 21 '14 at 09:54
@AnandaMahto : I tend to see it as the "it-has-to-be-in-SO-somewhere-but-I'm-too-lazy-to-search" box :P — alexis_laz, Jul 21 '14 at 10:03
@alexis_laz, but people don't read comments for the answers. So unless you're going to do the work of finding a duplicate to mark, it's much more helpful to the community if you took the 10 seconds to post an answer, let the OP accept it, and show clearly that this is a resolved question.... At least that's my perspective. — A5C1D2H2I1M1N2O1R2T1, Jul 21 '14 at 10:40

alexis_laz · Answer 1 · 2016-02-12T15:20:26.767

36

You could define a "cumulative paste" function using Reduce:

cumpaste = function(x, .sep = " ") 
          Reduce(function(x1, x2) paste(x1, x2, sep = .sep), x, accumulate = TRUE)

cumpaste(letters[1:3], "; ")
#[1] "a"       "a; b"    "a; b; c"

Reduce's loop avoids re-concatenating elements from the start as it elongates the previous concatenation by the next element.

Applying it by group:

ave(as.character(testdf$content), testdf$id, FUN = cumpaste)
#[1] "A"     "A B"   "A B A" "B"     "B C"   "B C B"

Another idea, could to concatenate the whole vector at start and, then, progressively substring:

cumpaste2 = function(x, .sep = " ")
{
    concat = paste(x, collapse = .sep)
    substring(concat, 1L, cumsum(c(nchar(x[[1L]]), nchar(x[-1L]) + nchar(.sep))))
}
cumpaste2(letters[1:3], " ;@-")
#[1] "a"           "a ;@-b"      "a ;@-b ;@-c"

This seems to be somewhat faster, too:

set.seed(077)
X = replicate(1e3, paste(sample(letters, sample(0:5, 1), TRUE), collapse = ""))
identical(cumpaste(X, " --- "), cumpaste2(X, " --- "))
#[1] TRUE
microbenchmark::microbenchmark(cumpaste(X, " --- "), cumpaste2(X, " --- "), times = 30)
#Unit: milliseconds
#                  expr      min       lq     mean   median       uq      max neval cld
#  cumpaste(X, " --- ") 21.19967 21.82295 26.47899 24.83196 30.34068 39.86275    30   b
# cumpaste2(X, " --- ") 14.41291 14.92378 16.87865 16.03339 18.56703 23.22958    30  a

...which makes it the cumpaste_faster.

edited Feb 12 '16 at 15:20

answered Jul 21 '14 at 11:29

alexis_laz

12,884
4
27
37

27

***Terrible*** name for a function, but +1 for the answer (and thanks for bearing with my comments) :-) – A5C1D2H2I1M1N2O1R2T1 Jul 21 '14 at 13:33
1

@AnandaMahto : Ha, I knew I had to go with "foo", but sometimes you have to call it as you see it :). (or, in this case, as you use it..) – alexis_laz Jul 21 '14 at 17:44
How should I modify cumpaste2 function if I want to group by two columns? e.g., testdf$id and testdf$id2 ? – jackDanielle Apr 14 '22 at 03:21
@jackDanielle : If I understand correctly, you don't need to modify the function. Just group by both columns; e.g. something like `ave(x, id, id2, FUN = cumpaste)` – alexis_laz Apr 14 '22 at 10:10
@alexis_laz ```ave(x, id, id2, FUN = cumpaste)``` works but when I use compaste2 function it gives me the subscript out of bound error from the function itself. I'm guessing I need to change ```substring(concat, 1L, ... ```) part.. – jackDanielle Apr 14 '22 at 15:19
@jackDanielle : I think I've debugged the issue. It seems that adding `if(!length(x)) return(x)` before all in the function fixes it? – alexis_laz Apr 14 '22 at 19:30

score 5 · Answer 2 · answered Jul 21 '14 at 11:33

5

data.table solution

library(data.table)
setDT(testdf)[, content2 := sapply(seq_len(.N), function(x) paste(content[seq_len(x)], collapse = " ")), by = id]
testdf

##    id content content2
## 1:  a       A        A
## 2:  a       B      A B
## 3:  a       A    A B A
## 4:  b       B        B
## 5:  b       C      B C
## 6:  b       B    B C B

answered Jul 21 '14 at 11:33

David Arenburg

91,361
17
137
196

Best and simpler solution! – Alex Sep 30 '19 at 11:49
This is awesome. If you only want to paste the unique values just add `unique` before content. – desval Jan 09 '21 at 20:44

score 4 · Answer 3 · answered Nov 02 '19 at 10:22

One option using dplyr and purrr could be:

testdf %>%
 group_by(id) %>%
 transmute(content_concatenated = accumulate(content, ~ paste(.x, .y)))

  id    content_concatenated
  <chr> <chr>               
1 a     A                   
2 a     A B                 
3 a     A B A               
4 b     B                   
5 b     B C                 
6 b     B C B

score 2 · Answer 4 · edited Jul 21 '14 at 11:10

2

Here's a ddply method using sapply and subsetting to paste together incrementally:

library(plyr)
ddply(testdf, .(id), mutate, content_concatenated = sapply(seq_along(content), function(x) paste(content[seq(x)], collapse = " ")))
  id content content_concatenated
1  a       A                    A
2  a       B                  A B
3  a       A                A B A
4  b       B                    B
5  b       C                  B C
6  b       B                B C B

edited Jul 21 '14 at 11:10

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

answered Jul 21 '14 at 10:50

James

65,548
14
155
193

You probably would want to mention where `ddply` comes from – David Arenburg Jul 21 '14 at 11:10

score 2 · Answer 5 · answered Jul 21 '14 at 17:23

2

You may also try dplyr

 library(dplyr)
 res <- testdf%>%
        mutate(n=row_number()) %>%
        group_by(id) %>%
        mutate(n1=n[1L]) %>%
        rowwise() %>% 
        do(data.frame(cont_concat= paste(content[.$n1:.$n],collapse=" "),stringsAsFactors=F))

 res$cont_concat
 #[1] "A"     "A B"   "A B A" "B"     "B C"   "B C B"

answered Jul 21 '14 at 17:23

akrun

874,273
37
540
662

1

This only seems to work because `content` is manually defined as a vector above. Is there a way to do it all within dplyr without having to refer to objects outside the dataframe? – Erik Shilts Jul 19 '17 at 02:22
@ErikShilts Please post as a new question – akrun Jul 19 '17 at 04:46

score 0 · Answer 6 · answered Dec 10 '19 at 17:45

For cumulative functions I recommend runner package with runner function which can apply any algorithm on cumulative window. It can't compete with @alexis_laz solution in terms of speed, but if one needs window of certain size, lag or windows dependent on date - I would suggest to use runner.


id <- c("a", "a", "a", "b", "b", "b")
content <- c("A", "B", "A", "B", "C", "B")
testdf <- data.frame(id, content, stringsAsFactors=FALSE)

library(runner)
library(dplyr)
testdf %>%
  group_by(id) %>%
  mutate(
    result = runner(x = content, 
                    f = function(x) paste(x, collapse = " "),
                    type = "character")) # specify output type - by default numeric

For more go to documentation and vignettes

Cumulatively paste (concatenate) values grouped by another variable

6 Answers6

Linked

Related