Cumulative sum with a threshold window in R data.table

Question

I want to calculate the rolling sum of n rows in my dataset where the window size 'n' depends on the sum itself. For example, I want to slide the window as soon as the rolling sum of time exceeds 5 mins. Basically, I want to calculate how much distance the person traveled in the last 5 mins but the time steps are not equally spaced. Here's a dummy data.table for clarity (the last two columns are required):

I am looking for a data.table solution in R

Input data table:

ID	Distance	Time
1	2	2
1	4	1
1	2	1
1	2	2
1	3	3
1	6	3
1	1	1

Desired Output:

ID	Distance	Time	5.min.rolling.distance	5.min.rolling.time
1	2	2	NA	NA
1	4	1	NA	NA
1	2	1	NA	NA
1	2	2	10	6
1	3	3	5	5
1	6	3	9	6
1	1	1	10	7

@jangorecki Thanks but adaptive only allows to have different window sizes for each row but the window size is a fixed vector. — srishti, Dec 28 '20 at 10:25

Lief Esbenshade · Answer 1 · 2020-12-30T18:53:11.840

Here is a solution that works with double time units as well as a simpler solution that will work with integer time units. I tested the double solution on 10,000 records and on my 2015 laptop it executed instantly. I can't make any guarantees about performance on 40 GB of data.

If you wanted to generalize this code I'd look at the RcppRoll package and learn how to implement c++ code in R.

Solution with `double` time units

I broke this down into two problems. First, figure out the window size by looking back until we get to at least 5 minutes (or run out of data). Second, take the sum of distances and time from the current observation to the look back unit.

Bad loop code in R usually tries to 'grow' a vector, its a huge efficiency gain to pre-allocate the vector length and then change elements in it.

input <- data.frame(
  dist = c(2, 4, 2, 2, 3, 6, 1),
  time = c(2, 1, 1, 2, 3, 3, 1)
)

var_window_cumsum <- function(input, MIN_TIME) {
  
  if(is.null(input$time) | is.null(input$dist)) {
    stop("input must have variables time and dist that record the row's duration and distance traveled.")
  }
  
  n <- nrow(input)
  
  # First, figure out how far we need to look back to, this vector will store
  # the position of the first record that gets our target record up to 5 min or
  # more. If we cant look back to 5 min, we leave it as NA.
  time_indx = rep(NA_integer_, length = n) # always preallocate your vector!
  for(time in (1:n)) {
    prior = time # start at self in case observation is already >= MIN_TIME
    while(sum(input$time[time:prior]) < MIN_TIME & prior > 1) {
      prior = prior - 1
    }
    
    # if we cant look back to our minimum time, leave the indx as NA
    if (sum(input$time[time:prior]) >= MIN_TIME) {
      time_indx[time] = prior 
    }
  }
  
  # Now that we know how far to look back, its easy to find out the total distance
  # and total time.
  dist5 =  rep(NA_integer_, n)
  time5 =  rep(NA_integer_, n)
  for (i in 1:n) {
    dist5[i] <- ifelse(!is.na(time_indx[i]), 
                       sum(input$dist[i:time_indx[i]]),
                       NA)
    time5[i] <- ifelse(!is.na(time_indx[i]), 
                       sum(input$time[i:time_indx[i]]),
                       NA)
  }
  
  cbind(input, 
        window_dist = dist5, 
        window_time = time5, 
        window_start = time_indx)
}

# output looks good 
# Warning: example data does not include exhaustive cases 
# I have not setup thorough testing
var_window_cumsum(input, 5)

# Test on a larger dataset, 10k records
set.seed(1234)
n <- 10000
med_input <- data.frame(
  dist = sample(1:5, n, replace = TRUE),
  time = sample(1:60, n, replace = TRUE) / 10
)

# you should inspect this to make sure there are no errors
med_output <- var_window_cumsum(med_input, 5)

Solution with `integer` time units

If your time unit is in integers and your data isn't too big, it may work to complete your dataset. This is a little bit of a hack, but here I create a continuos timeid variable that goes from the starting time to the maximum time, and create one row for each integer unit of time. From there its easy to calculate a rolling cumulative sum for the last five time units. Finally, we get rid of all the fake rows we added in (you want to make sure to do that because they will have invalid cumulative sum data. Also, important to note that I use roll_sumr and not roll_sum; roll_sumr includes 4 padding NA's on the left side of the output vector for the first 4 units.

library(tidyverse)
library(RcppRoll)

input <- data.frame(
  dist = c(2, 4, 2, 2, 3, 6, 1),
  time = c(2, 1, 1, 2, 3, 3, 1)
)

desired_dist5 <- c(NA, NA, NA, 10, 5, 9, 10)
desired_time5 <- c(NA, NA, NA, 6, 5, 6, 7)

output <- input %>% 
  mutate(timeid = cumsum(time), 
         realrow = TRUE) %>% 
  complete(timeid = 1:max(timeid)) %>% 
  mutate(dist5 = roll_sumr(dist, 5, na.rm = T),
         time5 = roll_sumr(time, 5, na.rm = T)) %>% 
  filter(realrow) %>% 
  select(-c(realrow, timeid))

# Check against example table
output$dist5 == desired_dist5
output$time5 == desired_time5

Thanks for your answer. But dist5 is not available in my data. I need to calculate that as well. Updated my question for clarity — srishti, Dec 28 '20 at 10:23
Sure, that was clear before. My code is not materially changed, but I edited it to make it explicit that your desired output is being used to check the validity of the solution. — Lief Esbenshade, Dec 28 '20 at 19:40
Thanks for updating. This is a neat solution. It works like a charm but the problem is I have 40 GB of data with the time unit as double. So I have time as 1.2, 3.6, etc. and this solution doesn't work. It would be great if you could please help me figure out the solution for non-integer time units. — srishti, Dec 30 '20 at 03:18
Added another solution for double units. No guarantees about performance time with that much data. — Lief Esbenshade, Dec 30 '20 at 18:54
You're right about preallocating a vector @LiefEsbenshade. Looks like you improved on my loop solution with your update so I'll remove my answer. — QAsena, Jan 08 '21 at 23:20

Cumulative sum with a threshold window in R data.table

1 Answers1

Solution with `double` time units

Solution with `integer` time units

Linked

Cumulative sum with a threshold window in R data.table

1 Answers1

Solution with double time units

Solution with integer time units

Linked

Solution with `double` time units

Solution with `integer` time units