1

I am fairly new to R and am trying to find a rolling standard deviation over a period of several months (3, 6, 9) in integer groups. For instance, for a year of data and three groups, I would like to find the standard deviation for each group 1, 2, 3 for (Jan, Feb, Mar), (Feb, Mar, Apr), (Mar, Apr, May), etc.

In my dataframe df, I have column NUM: with the values to find standard deviation from, column NO: integers defining groups, and column date: which has daily dates. I also made column Yr_Mo, which is an integer corresponding to the year and month of a date. So for instance, all January dates in 2017 would have the value 1701 in the column Yr_Mo

For one month, I used aggregate: new <- aggregate( NUM ~ Yr_Mo + NO, df, sd )

This is pretty basic. However, it seems to be more complicated for groups of 3+ months. Because not all months are the same length and some months have missing dates, I cannot hardcode for certain time intervals. I've seen a lot of posts about similar questions to mine but these questions seem to ask in general about finding rolling std devs or else grouping but not both. I was thinking of using zoo functions such as rollapply, but again can't see how to consider both parts of my problem.

Thanks in advance for any help or pointers to documentation I might learn from!

NO date       Yr_Mo  NUM
1  2017-01-01 1701   3.4
1  2017-01-02 1701   5
1  2017-01-12 1701   4.2
1  2017-01-13 1701   1
1  2017-01-20 1701   6
1  2017-02-03 1702   3.9
1  2017-02-08 1702   5.5
1  2017-02-15 1702   8
1  2017-02-22 1702   1.1
1  2017-02-26 1702   4
1  2017-03-02 1703   1
1  2017-03-07 1703   7.5
1  2017-03-11 1703   2
1  2017-03-20 1703   3.1
1  2017-03-28 1703   2
1  2017-04-01 1704   2
1  2017-04-05 1704   3.5
1  2017-04-12 1704   1
1  2017-04-19 1704   4.1
1  2017-04-23 1704   5
1  2017-05-02 1705   1
1  2017-05-03 1705   4.5
1  2017-05-04 1705   2
1  2017-05-10 1705   6.1
1  2017-05-20 1705   7
2  2017-01-01 1701   3
2  2017-01-02 1701   53
2  2017-01-11 1701   2
2  2017-01-15 1701   4.1
2  2017-01-22 1701   1
2  2017-02-01 1702   8.9
2  2017-02-08 1702   1.5
2  2017-02-15 1702   3
2  2017-02-27 1702   7.2
2  2017-02-28 1702   4
2  2017-03-02 1703   1
2  2017-03-07 1703   5.2
2  2017-03-11 1703   2
2  2017-03-21 1703   1
2  2017-03-28 1703   2
2  2017-04-01 1704   2.4
2  2017-04-05 1704   3.5
2  2017-04-11 1704   1
2  2017-04-19 1704   4.1
2  2017-04-23 1704   3
2  2017-05-02 1705   1.2
2  2017-05-03 1705   4.5
2  2017-05-04 1705   2
2  2017-05-10 1705   6.1
2  2017-05-21 1705   9
csny
  • 13
  • 5
  • Welcome to Stack Overflow, in order to find help here, please consider [how to write a reproducible example](https://stackoverflow.com/a/5963610/6574038), thank you. – jay.sf Feb 20 '18 at 21:03
  • Without seeing your data it's hard to say more, but the general idea would be for you to create a "group" variable that splits your data into non-overlapping blocks of 3+ months each. Then the usual `group_by` semantics should be straightforward. – kgolyaev Feb 20 '18 at 21:09
  • Hm, I don't understand how I could use non-overlapping blocks to get rolling std devs? I have added a version of the data I am using in case that makes my question more clear. – csny Feb 20 '18 at 23:42

2 Answers2

2

Using the definition of variance (see sample variance) and what OP has mentioned in the question (i.e. aggregate and rollapply), we can calculate the rolling 3 months standard deviation as follows. More comments inline.

winsize <- 3

#calculate sum of squares of NUM by month and group
sumxsq <- aggregate(NUM ~ Yr_Mo + NO, df, function(x) sum(x^2))
names(sumxsq) <- c("Yr_Mo", "NO", "SUM_X_SQ")

#calculate sum of NUM by month and group
sumx <- aggregate(NUM ~ Yr_Mo + NO, df, sum)
names(sumx) <- c("Yr_Mo", "NO", "SUM_X")

#count number of observations by month and group
nobs <- aggregate(NUM ~ Yr_Mo + NO, df, length)
names(nobs) <- c("Yr_Mo", "NO", "N")

#merge all stats together
mySD <- merge(merge(sumxsq, sumx, by=c("NO","Yr_Mo")), nobs, by=c("NO","Yr_Mo"))

#calculate rolling sample variance using zoo::rollapplyr by group, then take sqrt for sd
mySD$STD_DEV <- sqrt(unlist(by(mySD, mySD$NO, function(submySD) {
    zoo::rollapplyr(submySD, 
        width=winsize, 
        FUN=function(x) (sum(x[,"SUM_X_SQ"]) - sum(x[,"SUM_X"])^2 / sum(x[,"N"])) / (sum(x[,"N"]) - 1), 
        by.column=FALSE,
        fill=NA)
})))
mySD

Solution assumes that there are at least 1 data point in each month for each group. Please let me know if this helps.

data:

df <- read.csv(text="NO,date,Yr_Mo,NUM
1,2017-01-01,1701,3.4
1,2017-01-02,1701,5
1,2017-01-12,1701,4.2
1,2017-01-13,1701,1
1,2017-01-20,1701,6
1,2017-02-03,1702,3.9
1,2017-02-08,1702,5.5
1,2017-02-15,1702,8
1,2017-02-22,1702,1.1
1,2017-02-26,1702,4
1,2017-03-02,1703,1
1,2017-03-07,1703,7.5
1,2017-03-11,1703,2
1,2017-03-20,1703,3.1
1,2017-03-28,1703,2
1,2017-04-01,1704,2
1,2017-04-05,1704,3.5
1,2017-04-12,1704,1
1,2017-04-19,1704,4.1
1,2017-04-23,1704,5
1,2017-05-02,1705,1
1,2017-05-03,1705,4.5
1,2017-05-04,1705,2
1,2017-05-10,1705,6.1
1,2017-05-20,1705,7
2,2017-01-01,1701,3
2,2017-01-02,1701,53
2,2017-01-11,1701,2
2,2017-01-15,1701,4.1
2,2017-01-22,1701,1
2,2017-02-01,1702,8.9
2,2017-02-08,1702,1.5
2,2017-02-15,1702,3
2,2017-02-27,1702,7.2
2,2017-02-28,1702,4
2,2017-03-02,1703,1
2,2017-03-07,1703,5.2
2,2017-03-11,1703,2
2,2017-03-21,1703,1
2,2017-03-28,1703,2
2,2017-04-01,1704,2.4
2,2017-04-05,1704,3.5
2,2017-04-11,1704,1
2,2017-04-19,1704,4.1
2,2017-04-23,1704,3
2,2017-05-02,1705,1.2
2,2017-05-03,1705,4.5
2,2017-05-04,1705,2
2,2017-05-10,1705,6.1
2,2017-05-21,1705,9", header=TRUE)
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
1

You could make a function to split your data, use your Yr_Mo column to create upper and lower boundaries to subset, and then just grab the sd() value for the subset range. Where df is the dataset that you provided above, first rearrange the dataset (not needed, but makes it easier to check that output is correct)

Sorry, completely missed that you were wanting to keep the NO grouping as well. This should do the trick (df here is the example data you provided above):

This function iterates over each unique Yr_Mo value to generate upper and lower bounds for the range (in this case, x - 1 : x + 1). It then subsets the provided dataframe based on these bounds and calculates the sd for NUM. If the subset is not valid (there are fewer than three months available for the time frame) the output is NA.

roll_sd <- function(df_, lead = 1, lag = -1) {
  id_sd <- do.call(rbind, lapply(unique(df_$Yr_Mo), function(x) {
    start = x + lag
    end = x + lead
    group = df_[df_$Yr_Mo >= start & df_$Yr_Mo <= end,]
    group_sd = sd(group$NUM)
    group_sd = ifelse(length(unique(group$Yr_Mo)) < 3, NA, sd(group$NUM))
    out = data.frame(central_value = x, group_sd)
  })
  )
}

Then, use group_by to apply this function to each grouping of NO:

library(dplyr)

df2 <- df %>% 
  group_by(NO) %>%
  do(roll_sd(data.frame(.)))

> as.data.frame(df2)
   NO central_value  group_sd
1   1          1701        NA
2   1          1702  2.248449
3   1          1703  2.209460
4   1          1704  2.179406
5   1          1705        NA
6   2          1701        NA
7   2          1702 13.046809
8   2          1703  2.311833
9   2          1704  2.270305
10  2          1705        NA

The central_value column is the "middle" month value for the sliding window.

Luke C
  • 10,081
  • 1
  • 14
  • 21