3

I have soccer results data in the following format (thousands of observations):

     Div  date       value     pts
1    E0 2011-08-13   Blackburn 0.0
2    E0 2011-08-13      Fulham 0.5
3    E0 2011-08-13   Liverpool 0.5
4    E0 2011-08-13   Newcastle 0.5
5    E0 2011-08-13         QPR 0.0
6    E0 2011-08-13       Wigan 0.5
7    E0 2011-08-14       Stoke 0.5
8    E0 2011-08-14   West Brom 0.0
9    E0 2011-08-15    Man City 1.0
10   E0 2011-08-20     Arsenal 0.0
11   E0 2011-08-20 Aston Villa 1.0

plus other variables. "value" is the team, pts is the final result (win/loss/draw) as a numerical value. I'm trying to add a new variable which is the average of this value over the last X games for the team in that row. How do I do this without using some horrible loop?

user1185563
  • 33
  • 1
  • 3

3 Answers3

3

take a look at this

using the zoo package and rollmean and the plyr package's ddply:

library(zoo)
library(plyr)
dat <- data.frame(value=letters[1:5], pts=sample(c(0, 0.5, 1), 50, replace=T))
ddply(dat, .(value), summarise, rollmean(pts, k=5, align='right'))

however, as far as I understand a "rolling average" it shortens your data set by definition. you can supply a fill argument though:

ddply(dat, .(value), summarise, rollmean(pts, k=5, fill=NA, align='right'))
Community
  • 1
  • 1
Justin
  • 42,475
  • 9
  • 93
  • 111
1

Try ave function from stats.

Trt <- gl(n=2, k=3, length=2*3, labels =c("A", "B"))
Y <- 1:6
Data <- data.frame(Trt, Y)
 Data
  Trt Y
1   A 1
2   A 2
3   A 3
4   B 4
5   B 5
6   B 6
Data$TrtMean <- ave(Y, Trt, FUN=mean)
Data
  Trt Y TrtMean
1   A 1       2
2   A 2       2
3   A 3       2
4   B 4       5
5   B 5       5
6   B 6       5
MYaseen208
  • 22,666
  • 37
  • 165
  • 309
1

This can be done quite efficiently with tapply. I've altered your data somewhat by duplicating teams' games, with random scores and dates. This takes the mean of the most recent 2 games, as specified in the tail function.

# create some data
d <- structure(list(Div = structure(rep(1L, 33), .Label = " E0", 
  class = "factor"), date = structure(c(15013, 14990, 14996, 15001, 14995, 15006, 
  15020, 15032, 15023, 15022, 15015, 15016, 15034, 14994, 14986, 14998, 14982, 
  14979, 14980, 15016, 15031, 15013, 15031, 14999, 15025, 14978, 15007, 15026, 
  14992, 14997, 15023, 14986, 15028), class = "Date"), 
  value = structure(c(3L, 4L, 5L, 7L, 8L, 11L, 9L, 10L, 6L, 1L, 2L, 3L, 4L, 5L, 
  7L, 8L, 11L, 9L, 10L, 6L, 1L, 2L, 3L, 4L, 5L, 7L, 8L, 11L, 9L, 10L, 6L, 1L, 
  2L), .Label = c("Arsenal", "Aston Villa", "Blackburn", "Fulham", "Liverpool",
  "Man City", "Newcastle", "QPR", "Stoke", "West Brom", "Wigan"), 
  class = "factor"), pts = c(0.5, 0.5, 0.5, 1, 1, 1, 1, 0, 1, 0.5, 0, 1, 1, 1, 1, 
  0.5, 0.5, 0, 0.5, 0.5, 0, 0, 0, 1, 0, 0, 0.5, 0, 1, 0, 0.5, 0.5, 0.5)), 
  .Names = c("Div", "date", "value", "pts"), row.names = c(NA, 33L), 
  class = "data.frame")

# sort rows by date
d2 <- d[order(d$date),]
# mean of all games
tapply(d2$pts, d2$value, mean)
# mean of last 2 games
tapply(d2$pts, d2$value, function(x) mean(tail(x, 2)))

# To tidy up the output, you could use simplify=FALSE and do.call(rbind, x):
# e.g., mean of last 2 games:
do.call(rbind, tapply(d2$pts, d2$value, function(x) mean(tail(x, 2)), 
  simplify=F))

            [,1]
Arsenal     0.25
Aston Villa 0.25
Blackburn   0.50
Fulham      1.00
Liverpool   0.25
Man City    0.75
Newcastle   1.00
QPR         0.50
Stoke       1.00
West Brom   0.00
Wigan       0.50
jbaums
  • 27,115
  • 5
  • 79
  • 119
  • In fact, `aggregate` would do the job in one step, e.g. `aggregate(d2$pts, list(d2$value), function(x) mean(tail(x, 2)))` – jbaums Feb 05 '12 at 13:26