R -- How to conditionally sum binary time series data for large data frames

Question

I've been trying to solve this issue for too long now. I have binary insect outbreak data in annual time series format for 300+ years (rows) and 70+ trees (columns).

I'd like to conditionally fill a dataframe / matrix / data table of the same dimensions with cumulative sums, and have it reset to 0 at the end of each outbreak period. I've found a wealth of similar questions / answers that I just can't seem to translate to my issue.

I'll have a snippet of a dataframe, e.g., that looks like this:

      t1 t2 t3 t4 t5
2000   1  0  0  1  0
2001   1  0  0  0  1
2002   1  1  0  0  1
2003   0  1  0  1  1
2004   1  1  1  1  1

And I want to create a new df that looks like this:

      t1 t2 t3 t4 t5
2000   1  0  0  1  0
2001   2  0  0  0  1
2002   3  1  0  0  2
2003   0  2  0  1  3
2004   1  3  1  2  4

I've felt I've gotten close with both the data.table and rle packages, although I've also been going in tons of circles as well (pretty sure I did it for a single column once, but now can't remember what I did, or why I couldn't get it to work in a loop for all columns...).

I've always gotten the following methods to work to some extent, usually just a single column, or add one 1 df on top of a shifted df, so a single column might look like 0 1 2 2 1 0 instead of 0 1 2 3 4 0. Some attempts, if this helps, have been variations on code looking like this:

setDT(dt)[, new := t1 + shift(t1, fill = 0)]


apply(
  rle( matrix)$lengths
  , 2, seq)
rle( matrix[,1])$lengths 


for( i in 1:dim(dt)[1]) {
  for( j in 1:dim(dt)[2]) {
    cols <- names(dt) # tried in place of .SD with negative results
    if( dt[i,j] == 1) {
      dt[, new := .SD + shift(.SD, 1L, fill = 0, type = "lag", give.names = TRUE)]
    } else { dt }
  }
}

Some of the main SO sources I've used include these pages: data.table, dplyr, rle

Let me know if I'm missing any important info (I'm new!). & thank you so much for any help!

Psidom · Answer 1 · 2016-09-15T22:48:44.263

1

You can use data.table combined with the ave function to calculate the cumsum of each column grouped by the rleid of the column itself:

library(data.table)
setDT(dt)[, names(dt) := lapply(.SD, function(col) ave(col, rleid(col), FUN = cumsum))][]

#   t1 t2 t3 t4 t5
#1:  1  0  0  1  0
#2:  2  0  0  0  1
#3:  3  1  0  0  2
#4:  0  2  0  1  3
#5:  1  3  1  2  4

edited Sep 15 '16 at 22:48

answered Sep 15 '16 at 22:43

Psidom

209,562
33
339
356

score 1 · Accepted Answer · answered Sep 16 '16 at 02:16

1

We can use rle with sequence from base R

df2 <- df1 #create a copy of df1
#loop through the columns of 'df2', apply the `rle`, get the 'sequence'
#of 'lengths' and multiply with the column values.
df2[] <- lapply(df2, function(x) sequence(rle(x)$lengths)*x)
df2
#     t1 t2 t3 t4 t5    
#2000  1  0  0  1  0
#2001  2  0  0  0  1
#2002  3  1  0  0  2
#2003  0  2  0  1  3
#2004  1  3  1  2  4

answered Sep 16 '16 at 02:16

akrun

874,273
37
540
662

1

This works quickly and easily. Thanks; a week of tearing my hair out and it's this simple! I never realized I could leave df2's brackets completely empty of references to rows / columns. I've also never toyed with function(x) floating within a function before, defined without {}. That's really new / confusing to me. – alaskayo Sep 16 '16 at 19:11

R -- How to conditionally sum binary time series data for large data frames

2 Answers2