1

This is a follow on question from Cumulative Mean with Grouping and Lag and Grouped moving average in r.

I'm looking to create a cumulative mean field with a lag of one that groups over multiple variables but is only calculating the average on certain criteria. So for the example below, S-AVG only gives the cumulative mean for S and vice versa for O-AVG and J-AVG. I'm sure this is possible using ave and cumsum, but am unsure how to do it.

Here is the desired output:

Player  Goals   **S-AVG**    **O-AVG**    **J-AVG**
S       5                         
S       2       5
S       7       3.5         
O       3                    
O       9                     3
O       6                     6      
O       3                     3 
S       7       4.66         
O       1                     5.25
S       7       5.25         
S       3       5.6         
Q       8                     4.4            
S       3       5.16                  
O       4                     5           
P       1                     4.857
S       9       4.857         
S       4       5.375         
Z       6                     4.375        
S       3       5.22         
O       8                     4.55            
S       3       5                  
O       4                     4.9      
O       1                     4.81      
S       9       4.81                  
S       4       5.16                  
O       6                     4.5       
J       6                     

Here is the data input for r

Player <- c('S','S','S','O','O','O','O','S','O','S','S','O','S','O','O','S','S','O','S','O','S','O','O','S','S','O','J')
Goals <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1,9,4,6,3,8,3,4,1,9,4,6,6)
data.frame(Player, Goals)

Any help is appreciated.

Community
  • 1
  • 1
greeny
  • 425
  • 1
  • 6
  • 20
  • I think this may be more straightforward if you simply subset into 3 data frames, compute the cumulative average and then recombine. – giraffehere Feb 16 '16 at 21:18
  • cant you just reshape data to this format from previous anwser? – mtoto Feb 16 '16 at 21:23
  • Also, melting and casting may help as well. Might be good to look into. – giraffehere Feb 16 '16 at 21:34
  • I want to be able to do this without having to subset and re-merge since I will be doing this on a large scale. I've been trying this with cumroll <- function(x) { x <- head(x, -1) c(NA, cumsum(x) / seq_along(x))} and ave(Goals, (Player == "S"), FUN = cumroll)) but its not working. Surely there must be a way of doing it within the ave? – greeny Feb 17 '16 at 20:24

2 Answers2

3

Suppose DF2 is the data frame computed in my answer to the prior post referenced in the question, i.e. the data frame having the AVG column. It is also reproduced in the Note at the end of this answer.

If we only had one or a small fixed number of players we could do this by writing out an AVG.* column for each player (shown for one player):

transform(DF2, AVG.S = ifelse(Player == "S", AVG, NA))

but a more general approach follows. Set levs to the levels of the Player factor or if you don't want all players then levs should be set to a character vector of just the players you want. Then use sapply to contruct a logical matrix and convert that to a matrix of 1s and NAs which is then scalar multiplied by AVG.

The solution has a number of desirable features -- it does not overwrite its input (which would be error prone) and it avoids needless repeated qualification (both thanks to transform), it uses the whole object approach rather than loops and subscripting, it leverages existing code avoiding duplication (by using the result of the prior solution that this question is a follow-on to) and is brief -- two lines of code. It uses no packages.

(Also note that as an alternative sapply(...) could be replaced with model.matrix(~ Player + 0) in which case the column names would be slightly different.)

levs <- levels(DF2$Player)
transform(DF2, Avg = ifelse(sapply(levs, `==`, Player), 1, NA) * AVG)

giving:

   Player Goals      AVG Avg.J    Avg.O    Avg.S
1       S     5       NA    NA       NA       NA
2       S     2 5.000000    NA       NA 5.000000
3       S     7 3.500000    NA       NA 3.500000
4       O     3       NA    NA       NA       NA
5       O     9 3.000000    NA 3.000000       NA
6       O     6 6.000000    NA 6.000000       NA
7       O     3 6.000000    NA 6.000000       NA
8       S     7 4.666667    NA       NA 4.666667
9       O     1 5.250000    NA 5.250000       NA
10      S     7 5.250000    NA       NA 5.250000
11      S     3 5.600000    NA       NA 5.600000
12      O     8 4.400000    NA 4.400000       NA
13      S     3 5.166667    NA       NA 5.166667
14      O     4 5.000000    NA 5.000000       NA
15      O     1 4.857143    NA 4.857143       NA
16      S     9 4.857143    NA       NA 4.857143
17      S     4 5.375000    NA       NA 5.375000
18      O     6 4.375000    NA 4.375000       NA
19      S     3 5.222222    NA       NA 5.222222
20      O     8 4.555556    NA 4.555556       NA
21      S     3 5.000000    NA       NA 5.000000
22      O     4 4.900000    NA 4.900000       NA
23      O     1 4.818182    NA 4.818182       NA
24      S     9 4.818182    NA       NA 4.818182
25      S     4 5.166667    NA       NA 5.166667
26      O     6 4.500000    NA 4.500000       NA
27      J     6       NA    NA       NA       NA

Note: This was used as input above:

DF2 <- structure(list(Player = structure(c(3L, 3L, 3L, 2L, 2L, 2L, 2L, 
3L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 
3L, 3L, 2L, 1L), .Label = c("J", "O", "S"), class = "factor"), 
    Goals = c(5, 2, 7, 3, 9, 6, 3, 7, 1, 7, 3, 8, 3, 4, 1, 9, 
    4, 6, 3, 8, 3, 4, 1, 9, 4, 6, 6), AVG = c(NA, 5, 3.5, NA, 
    3, 6, 6, 4.66666666666667, 5.25, 5.25, 5.6, 4.4, 5.16666666666667, 
    5, 4.85714285714286, 4.85714285714286, 5.375, 4.375, 5.22222222222222, 
    4.55555555555556, 5, 4.9, 4.81818181818182, 4.81818181818182, 
    5.16666666666667, 4.5, NA)), .Names = c("Player", "Goals", 
"AVG"), row.names = c(NA, -27L), class = "data.frame")
Community
  • 1
  • 1
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • it's a nice solution, but isn't going over the previous data frame not a bit the long way around? I'd also like to point out that `transform()` has non-standard evaluation of the arguments and hence should only be used interactively. – Joris Meys Feb 18 '16 at 15:12
  • Regarding transform, that's not "personal style". That's almost literally the warning on the help page. Same for subset. They can make the code much easier to read when used interactively. If you wrap a function around it, I refer to `?transform` for the advice of the R core team. – Joris Meys Feb 18 '16 at 15:48
  • Ever wondered why `mutate` exists and why Hadley doesn't simply use `transform`, which does exactly the same? And why he provided an argument `.dots` -quote- to work around non-standard evaluation -unquote- ? You can call it opinion, but I'm not the only one having it. Actually, that opinion is one of the reasons of existence for the `dplyr` package. – Joris Meys Feb 18 '16 at 16:28
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/103871/discussion-between-joris-meys-and-g-grothendieck). – Joris Meys Feb 18 '16 at 16:45
2

Another approach is simply using indices. First make a function cummean (which is trivial...):

cummean <- function(x){
  cumsum(x) / seq_along(x)
}

Then calculate the cumulative means and store in a list (simplify = FALSE ) :

avgs <- with(mydf,
             tapply(Goals,Player,cummean,
                    simplify = FALSE))

Finally, create the variables based on the player names, conveniently added as names of the list returned by tapply. I specifically use a for loop to avoid having to rebuild the complete data frame every time. Using indices, I can fill up the data frame in a quite more efficient way and still have that lag you want. :

for(i in names(avgs)){
  theavg <- avgs[[i]]
  mydf[[i]][mydf$Player == i] <- c(NA, theavg[-length(theavg)])
}
Joris Meys
  • 106,551
  • 31
  • 221
  • 263