10

I am looking for a way to do what would be the equivalent of a cumulative sum in R for string/character-formatted text instead of numbers. The different text fields should be concatenated.

E.g. in the data frame "df":

Column A contains the input, column B the desired result.

  A        B
1 banana   banana 
2 boats    banana boats
3 are      banana boats are
4 awesome  banana boats are awesome

Currently I am solving this via the following loop

df$B <- ""

for(i in 1:nrow(df)) {
    if (length(df[i-1,"A"]) > 0) {
        df$B[i] <- paste(df$B[i-1],df$A[i])
    } else {
        df$B[i] <- df$A[i]
    }
}

I wonder whether there exists a more elegant/faster solution.

Phil
  • 954
  • 1
  • 8
  • 22
  • It is not at all "cumsum"! –  Feb 12 '16 at 12:25
  • Is performance an issue? – Heroka Feb 12 '16 at 12:32
  • 2
    I _think_ the classic `cumpaste` appeared [**here**](http://stackoverflow.com/questions/24862046/cumulative-pasting-concatenating-values-grouped-by-another-variable-in-r/24864007#24864007) first (possible duplicate). Cudos to @alexis_laz. – Henrik Feb 12 '16 at 12:46
  • [Another similar Q&A](http://stackoverflow.com/questions/34778422/progressive-concatenation-of-a-column-by-a-group?lq=1), albeit also 'by group' like the answer above. But the 'by group' is rarely the tricky part... – Henrik Feb 12 '16 at 12:54
  • Thanks for all the answers! Found Reduce to be the fastest so marked that as top answer. Sorry in case this was a duplicate! It appears I searched for the wrong terms. – Phil Feb 12 '16 at 13:19

3 Answers3

12
(df$B <- Reduce(paste, as.character(df$A), accumulate = TRUE))
# [1] "banana"     "banana boats"      "banana boats are"    "banana boats are awesome"
Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • 1
    Impressive, and blazingly fast. (on an input vector of 1000 strings, 20x faster than my solution) – Heroka Feb 12 '16 at 12:31
  • @Heroka Reduce is just a `for` loop. – Roland Feb 12 '16 at 12:39
  • @Roland and so is sapply, but on my machine `Reduce` blew the other answers out of the park. I think it's the `accumulate = TRUE`. – Heroka Feb 12 '16 at 12:41
  • @Heroka Well, yes. Obviously it handles the accumulation better than your approach, but it's just nice syntactic sugar. If you look at the internal code you see a standard `for` loop. – Roland Feb 12 '16 at 12:44
  • TIL, thanks. Still reduce is faster, or I wrote an inefficient for-loop. – Heroka Feb 12 '16 at 12:46
  • 3
    @Roland it's not "just" a for loop. There's quite a lot more going on that explains the increase in speed. For a start, you have forced calls (see `?forceAndCall`. And more importantly, the function `Reduce` is compiled to bytecode already. Any compiled code will outperform a "hand made" for-loop. So calling it syntactic sugar is doing injustice to the function. – Joris Meys Feb 12 '16 at 12:51
  • @JorisMeys I have no issue with `Reduce`. I use it myself. But it is "just a `for` loop", though a well written one. You can byte-compile a `for` loop yourself and better performance is likely. – Roland Feb 12 '16 at 12:58
4

I don't know if it's faster, but at least the code is shorter:

sapply(seq_along(df$A),function(x){paste(A[1:x], collapse=" ")})

Thanks to Rolands comment, I realised that this was one of the rare occurences where a for-loop could be useful, as it saves us the repeated indexing. It differs from OP's as it starts at 2, saving the need for the if statment inside the forloop.

res <- c(NA, length(df1$A))
res[1] <- as.character(df1$A[1])
for(i in 2:length(df1$A)){
   res[i] <- paste(res[i-1],df1$A[i])
 }
res
Heroka
  • 12,889
  • 1
  • 28
  • 38
4

We can try

 i1 <- sequence(seq_len(nrow(df1)))
 tapply(df1$A[i1], cumsum(c(TRUE,diff(i1) <=0)),
                     FUN= paste, collapse=' ')

Or

 i1 <- rep(seq(nrow(df1)), seq(nrow(df1)))
 tapply(i1, i1, FUN= function(x) 
          paste(df1$A[seq_along(x)], collapse=' ') )
akrun
  • 874,273
  • 37
  • 540
  • 662