Let's take the following data:
dt <- data.table(TICKER=c(rep("ABC",10),"DEF"),
PERIOD=c(rep(as.Date("2010-12-31"),10),as.Date("2011-12-31")),
DATE=as.Date(c("2010-01-05","2010-01-07","2010-01-08","2010-01-09","2010-01-10","2010-01-11","2010-01-13","2010-04-01","2010-04-02","2010-08-03","2011-02-05")),
ID=c(1,2,1,3,1,2,1,1,2,2,1),VALUE=c(1.5,1.3,1.4,1.6,1.4,1.2,1.5,1.7,1.8,1.7,2.3))
setkey(dt,TICKER,PERIOD,ID,DATE)
Now for each ticker/period combination, I need the following in a new column:
PRIORAVG
: The mean of the latest VALUE of each ID, excluding the current ID, providing it is no more than 180 days old.PREV
: The previous value from the same ID.
The result should look like this:
TICKER PERIOD DATE ID VALUE PRIORAVG PREV
[1,] ABC 2010-12-31 2010-01-05 1 1.5 NA NA
[2,] ABC 2010-12-31 2010-01-08 1 1.4 1.30 1.5
[3,] ABC 2010-12-31 2010-01-10 1 1.4 1.45 1.4
[4,] ABC 2010-12-31 2010-01-13 1 1.5 1.40 1.4
[5,] ABC 2010-12-31 2010-04-01 1 1.7 1.40 1.5
[6,] ABC 2010-12-31 2010-01-07 2 1.3 1.50 NA
[7,] ABC 2010-12-31 2010-01-11 2 1.2 1.50 1.3
[8,] ABC 2010-12-31 2010-04-02 2 1.8 1.65 1.2
[9,] ABC 2010-12-31 2010-08-03 2 1.7 1.70 1.8
[10,] ABC 2010-12-31 2010-01-09 3 1.6 1.35 NA
[11,] DEF 2011-12-31 2011-02-05 1 2.3 NA NA
Note the PRIORAVG
on row 9 is equal to 1.7 (which is equal to the VALUE
on row 5, which is the only prior observation in the past 180 days by another ID
)
I have discovered the data.table
package, but I can't seem to fully understand the :=
function. When I keep it simple, it seems to work. To obtain the previous value for each ID (I based this on the solution to this question):
dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),roll=TRUE,mult="last"][,VALUE]]
This works great, and it only takes 0.13 seconds to perform this operation over my dataset with ~250k rows; my vector scan function gets identical results but is about 30,000 times slower.
Ok, so I've got my first requirement. Let's get to the second, more complex requirement. Right now the fasted method so far for me is using a couple of vector scans and throwing the function through the plyr
function adply
to get the result for each row.
calc <- function(df,ticker,period,id,date) {
df <- df[df$TICKER == ticker & df$PERIOD == period
& df$ID != id & df$DATE < date & df$DATE > date-180, ]
df <- df[order(df$DATE),]
mean(df[!duplicated(df$ID, fromLast = TRUE),"VALUE"])
}
df <- data.frame(dt)
adply(df,1,function(x) calc(df,x$TICKER,x$PERIOD,x$ID,x$DATE))
I wrote the function for a data.frame
and it does not seem to work with a data.table
. For a subset of 5000 rows this takes about 44 seconds but my data consists of > 1 million rows. I wonder if this can be made more efficient through the usage of :=
.
dt[J("ABC"),last(VALUE),by=ID][,mean(V1)]
This works to select the average of the latest VALUEs for each ID for ABC.
dt[,PRIORAVG:=dt[J(TICKER,PERIOD),last(VALUE),by=ID][,mean(V1)]]
This, however, does not work as expected, as it takes the average of all last VALUEs for all ticker/periods instead of only for the current ticker/period. So it ends up with all rows getting the same mean value. Am I doing something wrong or is this a limitation of :=
?