-2

I'm struggling to get the desired output using ddply. I believe I am on the right track but I think I am failing to output data from a loop, inside a loop...
Sample data:

Player, Career_Game, Date, ERA, Pitches

Gio Gonzalez, 176,  Aug 1,  3.0,    86

Gio Gonzalez, 177,  Aug 5,  4.01,   89

Gio Gonzalez, 178,  Aug 10, 4,  11

Gio Gonzalez, 179,  Aug 16, 4.06,   102

Gio Gonzalez, 180,  Aug 21, 3.83,   97

...............

Jordan Zimmermann,  114,    Apr 4,  1.8,    81

Jordan Zimmermann,  115,    Apr 9,  8.1,    57

Jordan Zimmermann,  116,    Apr 14, 5.27,   93

Jordan Zimmermann,  117,    Apr 19, 3.92,   100

..............

Ill call this data frame, BB.

So what I am trying to accomplish is I want to get the average of the previous, lets say 5 games for each player at each instance... for example so far I have the code below....

Pitchers_5 = data.frame(ddply(BB, ~Player, tail, n=5, numcolwise(mean)))

This successfully calculates the previous five games for the Player (Career_Games 176 through 180). However, I would like to get this average for each observation. So for career_game 177, the code would calculate the mean for games 172 through 176, then spit out an instance for 177 having the mean of the previous 5 games then continue to instance 178, and recalculate the previous 5 games and so on... so using the data from above, once the code got to Gio Gonzalez 181st career game, it would look like this (the average of the previous 5 games)

Gio Gonzalez, 178,  Date (Not necessary),   3.78,   77

UPDATE: Metrics comment has led me to look into the zoo package's rollmean function. I have since read a few posts and answers similar to my problem however am looking for further guidance (Rolling mean (moving average) by group/id with dplyr). This link resolves a very similar problem to mine except in 2 areas. It calculates the rolling mean of blood pressure by a unique ID into a new field, where I want to calculate the rolling mean of many fields. It also includes the blood pressure observation its on into its mean calculation. For example, Im looking for....
If I were to calculate the rolling means of Gio Gonzalez 180th game, I would want the mean of games 175 though 179. Not including the 180th game results.

Thanks!

Community
  • 1
  • 1
boothtp
  • 311
  • 4
  • 14
  • You can see `rollmean` from zoo package. – Metrics Mar 17 '15 at 02:59
  • Thanks metrics, rollmean has the functionality that I am looking for and led me to this post which is very similar to mine.(http://stackoverflow.com/questions/26198551/rolling-mean-moving-average-by-group-id-with-dplyr) However I still am looking for help, In the link posted above, the individual is looking to add the rolling mean of blood pressure by ID into a new field. In my example, I am trying to capture the same thing but I want the rolling mean applied to all numeric fields in my data frame. – boothtp Mar 17 '15 at 13:02
  • You are more likely to get an answer if you add an example data set, and desired outcome for that dataset. – ExperimenteR Mar 17 '15 at 13:27

1 Answers1

1

Assuming you want the rolling mean of ERA and Pitches and using 3 instead of 5 for illustration due to the size of the sample data set:

library(plyr)
library(zoo)

cbind(BB, ddply(BB, ~ Player, 
  function(x) rollapply(x[c("ERA", "Pitches")], list(-(1:3)), mean, fill = NA)))[-6]

giving:

             Player Career_Game   Date  ERA Pitches    ERA.1 Pitches.1
1      Gio Gonzalez         176  Aug 1 3.00      86       NA        NA
2      Gio Gonzalez         177  Aug 5 4.01      89       NA        NA
3      Gio Gonzalez         178 Aug 10 4.00      11       NA        NA
4      Gio Gonzalez         179 Aug 16 4.06     102 3.670000  62.00000
5      Gio Gonzalez         180 Aug 21 3.83      97 4.023333  67.33333
6 Jordan Zimmermann         114  Apr 4 1.80      81       NA        NA
7 Jordan Zimmermann         115  Apr 9 8.10      57       NA        NA
8 Jordan Zimmermann         116 Apr 14 5.27      93       NA        NA
9 Jordan Zimmermann         117 Apr 19 3.92     100 5.056667  77.00000

If its possible that some groups have less than 4 rows then use this. If there is one row it returns NAs. If there are less than 4 rows it reduces k so that it still returns something.

f <- function(x) {
    x <- as.matrix(x[c("ERA", "Pitches")])
    k <- min(3, nrow(x)-1)
    if (k) rollapply(x, list(-(1:k)), mean, fill = NA) else NA * x
}

cbind(BB, ddply(BB, ~ Player, f))[-6]

Note: We used this input:

Lines <- "Player, Career_Game, Date, ERA, Pitches
Gio Gonzalez, 176,  Aug 1,  3.0,    86
Gio Gonzalez, 177,  Aug 5,  4.01,   89
Gio Gonzalez, 178,  Aug 10, 4,  11
Gio Gonzalez, 179,  Aug 16, 4.06,   102
Gio Gonzalez, 180,  Aug 21, 3.83,   97
Jordan Zimmermann,  114,    Apr 4,  1.8,    81
Jordan Zimmermann,  115,    Apr 9,  8.1,    57
Jordan Zimmermann,  116,    Apr 14, 5.27,   93
Jordan Zimmermann,  117,    Apr 19, 3.92,   100"

BB <- read.csv(text = Lines, strip.white = TRUE, as.is = TRUE)

Updated to use plyr as requested. Also added variation which handles small groups.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks Grothendieck! This is very close to what I am looking for. The only addition would be to not include the current values into the rolling mean function (Might be confusing). Let me explain, so if you look at Gio Gonzalez 180th career game ERA rolling mean output that you generated (3.963333), this includes career_game 178, 179 and 180 ((4.00 + 4.06 + 3.83)/3 = 3.96333) into the rolling mean calculation. However I would want career_game 177, 178 and 179 ((4.01+4.00+4.06)/3 = 4.023) to be added to ERA.1 for career_game 180. Does that make sense? – boothtp Mar 17 '15 at 17:21
  • This could be fixed in two ways... somehow get rollmeanr to ignore or exclude the career_game its current iteration is on, or just insert a row (i.e. slide) the new rollmean output down one row for all observations. Thanks for your help! – boothtp Mar 17 '15 at 17:22
  • 1
    OK. Have revised to take the mean of the values in relative positions -1, -2 and -3, i.e. excluding the current value. – G. Grothendieck Mar 17 '15 at 18:02
  • Quick Question, I have been getting an error running the above code ( Error: wrong sign in 'by' argument ). It seems to not like when we have an ID, or Player with less than 3 instances. Is it possible to add error handling to catch these or if there are missing values, or NA values (like for games 1, 2 and 3) just use the nearest previous mean or single game value? For example, in the Pitches.1 field from your output from above, instead of NAs in Gio Gonzalez career game 177, it would update with the previous games Pitch count (86) – boothtp Mar 20 '15 at 00:59
  • 1
    Have added a variation which gives NA if there is one row in a group or if there are 2 or 3 rows then for the last row it takes the mean of the prior rows. Modify `f` if you want different rules. – G. Grothendieck Mar 20 '15 at 01:44