Starting cumsum() at a fixed day of each year

Question

I want to accumulate a variable that has been measured every day during a long period (including different years). The accumulation of the variable should start at a fixed day of the year (for example, the 1st of February, or in other words, the day-of-year (doy) 32 - I use to work in doys). Each year the cumsum should start at this fixed day.

I tried to use setDT(df)[, whatiwant := cumsum(variable), by = rleid(DOY >= 32)] or rle(DOY >= 32) but neither of them do not consider the first days of each year.

In theory, the function ave() should work fine but I do not know how to create a flag variable between doys of different years (it usually creates only the first one).

df <- data.frame(Date = seq(as.Date("2010-01-01"), by = 1, len = 1000),
                 Year = format(seq(as.Date("2010-01-01"), by = 1, len = 1000), "%Y"),
                 DOY = format(seq(as.Date("2010-01-01"), by = 1, len = 1000), "%j"),
                 Variable = rnorm(1000, mean=10, sd=3))

EDIT: Thanks for your help. How does it work with the data.table package?

scaumedes, welcome to SO! We can't answer you concretely because we don't know your data nor what code you've tried. Please make this question *reproducible*. This includes sample code (including listing non-base R packages), sample *unambiguous* data (e.g., `dput(head(x))` or `data.frame(x=...,y=...)`), and expected output. Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. — r2evans, Oct 21 '19 at 20:34

Michael Tuchman · Answer 1 · 2019-11-13T16:22:03.927

Let's start with a functional interface, which is a list of functions, along with their inputs and outputs, that will solve the problem.

The function reset_cum_sums takes two elements, a vector and a list of reset positions of that vector. The output will be a vector containing the cumulative sums, with the sums restarting at each desired position of the vector. An example should make this clearer:

At each reset position, the cumulative sum resets. So, if the inputs are 1:10 and the position vector is 3 5 7, the output would be

input: [1 2 3 4 5 6 7 8 9 10]

output: [1 3 3 7 5 11 7 15 24 34]

If no positions are given, this will produce the same result as cumsum.

is_feb_1st will return TRUE if a date is February First, FALSE otherwise. I will leave this as an exercise for you.
The functional interface uses primitive functions which,split, and lapply whose documentation is left as an exercise to read.

Now, the outline of the solution can be written as:

restart_feb_first<-function(data.frame) {
   reset_cum_sums(data.frame$value,  
       which(is_feb_first(data.frame$date))
   }

If the February firsts occur in your data at positions 32,32+365,32+730,.. that would make up your position vector. The nice thing is you can easily accommodate leap years.

The only challenging part is to write reset_cum_sums; Here I provide one way to do it, not necessarily the most efficient. The program splits the vector up into chunks, each one starting at the proper position (in your case, the February firsts). Note that the pipe operator is not required for this example. You could use traditional functional notation instead.

Also, I wrote the function this way to illustrate some R concepts, not necessarily to write the highest performing code. But, if you want to rewrite, you merely isolate your efforts on this function.

#
# purpose: define a function that creates cumulative  sums
# of vectors, but which reset at each position given by 
# the vector `positions`, which can be null.
# reset_sum

# parameters for hypothetical example
set.seed(18)
values=runif(50)

# cumulative sums reset at these positions.
positions=c(3,13,23,33,43)

# dependencies
require(magrittr) # or tidyverse for pipe operator


reset_sum = function(vector,positions) {
   k=length(vector)
  # cut the list into pieces 
  splitter=cut(1:k,breaks=c(-Inf,positions,Inf),right = FALSE)
  pieces=split(vector,splitter)
  # do the cumsum of each piece, and then glue then back together
  pieces %>%  lapply(cumsum) %>% unlist(use.names=FALSE)
}

Here's how the function would be called

# examples
reset_sum(values,positions)
reset_sum(rep(1,50),positions)

I hope this guides you ta solution that fits your needs. The key concept is to break it down until you find a function that is 'easy' to write in terms of R primitives. If you need reset_cum_sums to be super efficient, it should be fairly easy to write in C, or data.table, but let's leave that for another day.

Update

This function returns a vector, so to use it with the data table package, just add an assign, as in

DT[,new_column:=reset_sum(value,,isFebFirst(date)]

Starting cumsum() at a fixed day of each year

1 Answers1

Update