1

I have a large time-series data set with 8 key data variables. I am basically looking for the same data for the year before and creating a cumulative sum.

I have a code that works but it is incredibly slow, is there a more efficient way to

My current code works it is slow:

EDIT: My code wasn't reproducible. I am sorry for that. After more effort than I am proud of it should be now.

Source data

Data <- data.frame(
  V1 = rep(1:50, 25*3*20),
  V2 = rep(1:25, each= 50, 3*20),
  V3 = rep(1:3, each=50*25, 20),
  Year = rep(1:20, each=50*25*3),
  YearTotal = rnorm(75000), 
  CumulativeSum = 0)

Working but very slow code

for (i in 1:75000){
  Data$CumulativeSum[i] <- Data$YearTotal[i] +
    ifelse(Data$Year[i] == 1, 0, 
           Data$CumulativeSum[ 
           Data$V1 == Data$V1[i] &
               Data$V2 == Data$V2[i] &
               Data$V3 == Data$V3[i] &
               Data$Year + 1 == Data$Year[i] ])  
}

On my actual code which is quite a bit longer, and has character and period variables, takes me over 30 minutes to calculate. I couldn't find any similar problems on StackOverflow - maybe I didn't know what to search. I can't help but believe there must be a more elegant and fast way to get it done.

Novo88
  • 109
  • 1
  • 7

1 Answers1

3

You can use dplyr to speed up your code:

library(dplyr)

Data <- Data %>% 
  group_by(V1, V2, V3) %>% 
  arrange(Year) %>% 
  mutate(CumulativeSum = cumsum(YearTotal))

By grouping by V1, V2, and V3 we ensure that we only use the same combinations. Then we arrange by Year and take te cumulative sum.

Or with data.table:

library(data.table)

setDT(Data)
Data[order(Year), CumulativeSum := cumsum(YearTotal), by = .(V1, V2, V3)]
kath
  • 7,624
  • 17
  • 32