3

I'm trying to calculate a moving average in r over a particular field BUT I need this moving average to be grouped by two or more other fields. The purpose of this new average is for predictive analysis so I need it to be trailing as well. Any variables that do not have enough values to be averaged (such as student J) would ideally give either NA or its original Score value.

I've been trying rollapply and data.table and am having no luck!

I've provided the table of data and two moving averages (AVG2 with k=2 and AVG3 with k=3) to show exactly what I'm after. The moving average is on Score and the variables to group over are school, Student and area. Please help!

   no   school  Student area    Score **AVG2**  **AVG3**
   1    I       S       A       5      NA       NA
   2    B       S       A       2      NA       NA
   3    B       S       A       7      NA       NA
   4    B       O       A       3      NA       NA
   5    B       O       B       9      NA       NA
   6    I       O       A       6      NA       NA
   7    I       O       B       3      NA       NA
   8    I       S       A       7      NA       NA
   9    I       O       A       1      NA       NA
   10   B       S       A       7      4.5      NA
   11   I       S       A       3      NA       NA
   12   I       O       A       8      3.5      NA
   13   B       S       A       3      7        5.33
   14   I       O       A       4      4.5      5
   15   B       O       A       1      NA       NA
   16   I       S       A       9      5        5
   17   B       S       A       4      5        5.67
   18   B       O       A       6      2        NA
   19   I       S       A       3      6        6.33
   20   I       O       B       8      NA       NA
   21   B       S       A       3      3.5      4.67
   22   I       O       A       4      6        4.33
   23   B       O       A       1      3.5      3.33
   24   I       S       A       9      6        5
   25   B       S       A       4      3.5      3.33
   26   B       O       A       6      3.5      2.67
   27   I       J       A       6      NA       NA

here is the code to recreate the initial table in r:

school <- c('I','B','B','B','B','I','I','I','I','B','I','I','B','I','B','I','B','B','I','I','B','I','B','I','B','B','I')
Student <- c('S','S','S','O','O','O','O','S','O','S','S','O','S','O','O','S','S','O','S','O','S','O','O','S','S','O','J')
area <- c('A','A','A','A','B','A','B','A','A','A','A','A','A','A','A','A','A','A','A','B','A','A','A','A','A','A','A')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1,9,4,6,3,8,3,4,1,9,4,6,6)
data.frame(school, Student, area,  Score)
Bulat
  • 6,869
  • 1
  • 29
  • 52
greeny
  • 425
  • 1
  • 6
  • 20
  • Can you provide a sample R code, of how you have calculated AVG2 And AVG3? – Kumar Manglam Feb 07 '16 at 02:39
  • First, instead of providing a picture of your data please provide your data in a format that we can easily read into R, e.g. posting the output of `dput(dat)` or at least `head(dat)` in your question. Second, please share the desired output of your grouped moving average for this data. Third, please share the code you've tried so far. – josliber Feb 07 '16 at 02:49
  • I used excel to manually calculate AVG2 and AVG3 since I've had no luck with r. The desired output is the displayed AVG2 and AVG3 – greeny Feb 07 '16 at 07:50

3 Answers3

3

You can try solving the problem using dplyr and TTR but for student J from school I it is not possible to calculate a moving average as there's only one measurement.

AVG2 caluculated with stats:filter gives the result you wanted to have, but I also added AVG2b calculated with TTR::SMA to show a simple moving average calculation, where the current measurement is also taken into account.

library(dplyr)
library(TTR)

df <- data.frame(school, Student, Score)
df$AVG2 <- NA
df$AVG2b <- NA
df[!(df$school=="I" & df$Student=="J"),] <- df[!(df$school=="I" & df$Student=="J"),] %>% 
  group_by(school, Student) %>% 
  mutate(AVG2 = stats::filter(Score, c(0, 0.5, 0.5), sides = 1 ), AVG2b = SMA(Score, n= 2)) 

    > df
   school Student Score AVG2 AVG2b
1       I       S     5   NA    NA
2       B       S     2   NA    NA
3       B       S     7   NA   4.5
4       B       O     3   NA    NA
5       B       O     9   NA   6.0
6       I       O     6   NA    NA
7       I       O     3   NA   4.5
8       I       S     7   NA   6.0
9       I       O     1  4.5   2.0
10      B       S     7  4.5   7.0
...
kath
  • 7,624
  • 17
  • 32
  • My mistake, the manually calculated values were incorrect. I've amended them, do your results match now? – greeny Feb 07 '16 at 12:34
  • No, but you can also check yourself, or is the code not working for you? If your are using a simple moving average I think the value for the third row for AVG2 should be 4.5 and for the 5th row 6.0 and so on – kath Feb 07 '16 at 12:39
  • Im not sure this gives the trailing moving average that I'm after, the first 8 rows should be NA since there hasn't been a third value for both groups until row 9 (school I and student O). – greeny Feb 07 '16 at 12:54
  • I think now I understood how you are calculating the AVG2: For the third measurement of a group you take the previous 2 and take the mean from them. Moving average takes also the current measurement into account, if you don't want this you can use `stats::filter(Score, c(0, 0.5, 0.5), sides = 1)` instead of the SMA function... – kath Feb 07 '16 at 17:27
  • Is there any way I can do this and get r to just output student J's actual score since he only has 1 value. Maybe so that the first two values of any two group combo outputs as the actual score until the third appears and the averaging is applied? Appreciate your help so far. – greeny Feb 07 '16 at 20:50
  • What is the code to do this for a moving average other than 2? A moving window of 3 or 4 for example? – greeny Feb 08 '16 at 10:17
  • I edited my answer a little bit so that all the original information is kept in `df` and then you can set the value for the group which is too small to the score (subsetting and replace). For a moving average with a larger window, you should modify the filter in `stats::filter` accordingly, see `?stats::filter` for more help. – kath Feb 08 '16 at 23:18
  • Thanks for that. Do you have a more generic way to handle this when you are averaging over a group that is too small? The dataset I gave was a subset of a much large dataset that will have multiple grouping combinations that are too small. If there was a way to avoid the "Error: 'filter' is longer than time series" and just have 'NA's it would make my life so much easier!! – greeny Feb 09 '16 at 04:33
2

Here is a rollapply solution. Note that it appears that you want the average of the prior two or three rows in the same group, i.e. excluding the data on the current row.

library(zoo)

roll <- function(x, n) { 
   if (length(x) <= n) NA 
   else rollapply(x, list(-seq(n)), mean, fill = NA)
}
transform(DF, AVG2 = ave(Score, school, Student, FUN = function(x) roll(x, 2)),
              AVG3 = ave(Score, school, Student, FUN = function(x) roll(x, 3)))

giving:

   school Student Score AVG2     AVG3
1       I       S     5   NA       NA
2       B       S     2   NA       NA
3       B       S     7   NA       NA
4       B       O     3   NA       NA
5       B       O     9   NA       NA
6       I       O     6   NA       NA
7       I       O     3   NA       NA
8       I       S     7   NA       NA
9       I       O     1  4.5       NA
10      B       S     7  4.5       NA
11      I       S     3  6.0       NA
12      I       O     8  2.0 3.333333
13      B       S     3  7.0 5.333333
14      I       O     4  4.5 4.000000
15      B       O     1  6.0       NA
16      I       S     9  5.0 5.000000
17      B       S     4  5.0 5.666667
18      B       O     6  5.0 4.333333
19      I       S     3  6.0 6.333333
20      I       O     8  6.0 4.333333
21      B       S     3  3.5 4.666667
22      I       O     4  6.0 6.666667
23      B       O     1  3.5 5.333333
24      I       S     9  6.0 5.000000
25      B       S     4  3.5 3.333333
26      B       O     6  3.5 2.666667
27      I       J     6   NA       NA

Update: Fixed roll.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • How can I change this if I have another column to group by? I tried this method on a different dataset with more grouping fields and was getting the error: Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argument – greeny Feb 09 '16 at 20:47
  • Say there is a third grouping field, 'area' for the above example. So the moving average will now need to be grouped over school, Student and also area. How would the code change to account for 3 groups rather than 2? – greeny Feb 09 '16 at 23:41
  • Thanks for the response. Unfortunately when I do as you suggest, I get the following error: Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argument – greeny Feb 10 '16 at 07:20
  • I've edited the initial question to now include the extra 'area' field to be grouped over alongside new values for AVG2 and AVG3. If you could show me how you can do it I'd very much appreciate it! – greeny Feb 10 '16 at 12:39
  • Try: AVG2 = ave(Score, school, Student, area, FUN = function(x) roll(x, 2)). This is erroring for me now. n = 2 doesn't work, n = 3 does, n = 4 does, n = 5 doesn't – greeny Feb 10 '16 at 12:54
  • But your AVG2 isn't giving the same result as what I need. It has n = 3, rather than n = 2. Your AVG2 and AVG3 are exactly the same – greeny Feb 10 '16 at 13:00
  • OK. Have fixed `roll`. < should have been <= . – G. Grothendieck Feb 10 '16 at 13:06
  • Thank you so much! You are a hero – greeny Feb 10 '16 at 13:09
1

Here is AVG2 calculation with data.table, which is faster compared to other approaches:

library(data.table)
dt <- data.table(df)
setkey(dt, school, Student, area)
dt[, c("start", "len") := .(ifelse(.I + 1 > .I[.N], 0, .I +1), pmax(pmin(1, .I[.N] - .I -1), 0)), by = .(school, Student, area)][
    , AVG2 := mean(dt$Score[start:(start+len)]), by = 1:nrow(dt)]
res$AVG2[res$len == 0] <- NA
Bulat
  • 6,869
  • 1
  • 29
  • 52