R: using data.table := operations to calculate new columns

Question

Let's take the following data:

dt <- data.table(TICKER=c(rep("ABC",10),"DEF"),
        PERIOD=c(rep(as.Date("2010-12-31"),10),as.Date("2011-12-31")),
        DATE=as.Date(c("2010-01-05","2010-01-07","2010-01-08","2010-01-09","2010-01-10","2010-01-11","2010-01-13","2010-04-01","2010-04-02","2010-08-03","2011-02-05")),
        ID=c(1,2,1,3,1,2,1,1,2,2,1),VALUE=c(1.5,1.3,1.4,1.6,1.4,1.2,1.5,1.7,1.8,1.7,2.3))
setkey(dt,TICKER,PERIOD,ID,DATE)

Now for each ticker/period combination, I need the following in a new column:

PRIORAVG: The mean of the latest VALUE of each ID, excluding the current ID, providing it is no more than 180 days old.
PREV: The previous value from the same ID.

The result should look like this:

      TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV
 [1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA
 [2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5
 [3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4
 [4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4
 [5,]    ABC 2010-12-31 2010-04-01  1   1.7     1.40  1.5
 [6,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA
 [7,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3
 [8,]    ABC 2010-12-31 2010-04-02  2   1.8     1.65  1.2
 [9,]    ABC 2010-12-31 2010-08-03  2   1.7     1.70  1.8
[10,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA
[11,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

Note the PRIORAVG on row 9 is equal to 1.7 (which is equal to the VALUE on row 5, which is the only prior observation in the past 180 days by another ID)

I have discovered the data.table package, but I can't seem to fully understand the := function. When I keep it simple, it seems to work. To obtain the previous value for each ID (I based this on the solution to this question):

dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),roll=TRUE,mult="last"][,VALUE]]

This works great, and it only takes 0.13 seconds to perform this operation over my dataset with ~250k rows; my vector scan function gets identical results but is about 30,000 times slower.

Ok, so I've got my first requirement. Let's get to the second, more complex requirement. Right now the fasted method so far for me is using a couple of vector scans and throwing the function through the plyr function adply to get the result for each row.

calc <- function(df,ticker,period,id,date) {
  df <- df[df$TICKER == ticker & df$PERIOD == period 
        & df$ID != id & df$DATE < date & df$DATE > date-180, ]
  df <- df[order(df$DATE),]
  mean(df[!duplicated(df$ID, fromLast = TRUE),"VALUE"])
}

df <- data.frame(dt)
adply(df,1,function(x) calc(df,x$TICKER,x$PERIOD,x$ID,x$DATE))

I wrote the function for a data.frame and it does not seem to work with a data.table. For a subset of 5000 rows this takes about 44 seconds but my data consists of > 1 million rows. I wonder if this can be made more efficient through the usage of :=.

dt[J("ABC"),last(VALUE),by=ID][,mean(V1)]

This works to select the average of the latest VALUEs for each ID for ABC.

dt[,PRIORAVG:=dt[J(TICKER,PERIOD),last(VALUE),by=ID][,mean(V1)]]

This, however, does not work as expected, as it takes the average of all last VALUEs for all ticker/periods instead of only for the current ticker/period. So it ends up with all rows getting the same mean value. Am I doing something wrong or is this a limitation of := ?

Hints: join inherited scope for the prevailing observation with the last 180 days (using `i.` prefix: `[,j=list(...,age=PERIOD-i.PERIOD,...),][age<180]`, and `mult="last"` rather than `last()`, maybe. — Matt Dowle, May 22 '12 at 14:32
The data panel in question looks to be keyed differently to the code extract above it. And it's missing a `)` it seems. — Matt Dowle, May 22 '12 at 14:35
added data that shows the expected results of the 180 days requirement — Dirk, May 23 '12 at 11:34

score 12 · Accepted Answer · answered May 22 '12 at 18:45

Great question. Try this :

dt
     TICKER     PERIOD       DATE ID VALUE
[1,]    ABC 2010-12-31 2010-01-05  1   1.5
[2,]    ABC 2010-12-31 2010-01-08  1   1.4
[3,]    ABC 2010-12-31 2010-01-10  1   1.4
[4,]    ABC 2010-12-31 2010-01-13  1   1.5
[5,]    ABC 2010-12-31 2010-01-07  2   1.3
[6,]    ABC 2010-12-31 2010-01-11  2   1.2
[7,]    ABC 2010-12-31 2010-01-09  3   1.6
[8,]    DEF 2011-12-31 2011-02-05  1   2.3

ids = unique(dt$ID)
dt[,PRIORAVG:=NA_real_]
for (i in 1:nrow(dt))
    dt[i,PRIORAVG:=dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),
                      mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"]]
dt
     TICKER     PERIOD       DATE ID VALUE PRIORAVG
[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA
[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30
[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45
[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40
[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50
[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50
[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35
[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA

Then what you had already with a slight simplification ...

dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),VALUE,roll=TRUE,mult="last"]]

     TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV
[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA
[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5
[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4
[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4
[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA
[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3
[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA
[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

If this is ok as a prototype then a large speed improvement would be to keep the loop but use set() instead of :=, to reduce overhead :

for (i in 1:nrow(dt))
    set(dt,i,6L,dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),
                   mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"])
dt
     TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV
[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA
[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5
[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4
[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4
[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA
[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3
[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA
[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

That should be a lot faster than the repeated vector scans shown in the question.

Or, the operation could be vectorized. But that would be less easy to write and read due to the features of this task.

Btw, there isn't any data in the question that would test the 180 day requirement. If you add some and show expected output again then I'll add the calculation of age using join inherited scope I mentioned in comments.

Great answer. It only takes 20 minutes to calculate the first part of my dataset (180k rows) vs. several hours for the vector method. I like the use of setdiff() to select all but the current ID, but I think it might slow things down a bit with a large number of IDs (there are 6000 in my dataset and only an average of 16 IDs per ticker). — Dirk, May 23 '12 at 09:58
Good. 20 minutes still sounds very long for this task. Using `set()`? Anyway, as the mantra goes, `Rprof`, `Rprof`, `Rprof`. Yes on `setdiff()` (if `Rprof` does show that is causing the time), you could do that upfront and store a list or environment of the "other" ids for each id and then just look it up. Or there might be a simpler way I'm missing. — Matt Dowle, May 23 '12 at 10:28
That is indeed with using `set()`. `setdiff()` itself is not taking much time, it is the subsetting using the output of `setdiff()` that does. Testing with a subset of 5k rows, increasing `ids` from 738 to 5866 adds 60% calculation time. — Dirk, May 23 '12 at 11:21
Ok, how about applying [data.table wiki point 3](http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table)? For that to work you'll need to change `nomatch=0` to take out the `NA` because `.Internal(mean,na.rm=TRUE)` ignores the `na.rm` argument. This will be done automatically in v1.8.1. — Matt Dowle, May 23 '12 at 11:36
And may have to do away with the loop then, to get a binary merge rather than a full binary search for every row. — Matt Dowle, May 23 '12 at 11:38
You'd do that by removing all the `[i]`s and adding a `by=list(...)`, maybe. Perhaps have a go at that yourself. It sure is an interesting one. — Matt Dowle, May 23 '12 at 11:45
By the way, taking the output from Rprof, the biggest change when increasing the size of `ids` comes from .Call: increase in self.pct from 5.71 to 15.48. — Dirk, May 23 '12 at 12:02

score 1 · Answer 2 · answered Nov 20 '19 at 09:21

Another possible approach using later versions of data.table:

library(data.table) #data.table_1.12.6 as of Nov 20, 2019
cols <- copy(names(DT))
DT[, c("MIN_DATE", "MAX_DATE") := .(DATE - 180L, DATE)]

DT[, PRIORAVG := 
        .SD[.SD, on=.(TICKER, PERIOD, DATE>=MIN_DATE, DATE<=MAX_DATE),
            by=.EACHI, {
                subdat <- .SD[x.ID!=i.ID]
                pavg <- if (subdat[, .N > 0L])
                    mean(subdat[, last(VALUE), ID]$V1, na.rm=TRUE)
                else 
                    NA_real_
                c(setNames(mget(paste0("i.", cols)), cols), .(PRIORAVG=pavg))
            }]$PRIORAVG
]

DT[, PREV := shift(VALUE), .(TICKER, PERIOD, ID)]

output:

    TICKER     PERIOD       DATE ID VALUE   MIN_DATE   MAX_DATE PRIORAVG PREV
 1:    ABC 2010-12-31 2010-01-05  1   1.5 2009-07-09 2010-01-05       NA   NA
 2:    ABC 2010-12-31 2010-01-08  1   1.4 2009-07-12 2010-01-08     1.30  1.5
 3:    ABC 2010-12-31 2010-01-10  1   1.4 2009-07-14 2010-01-10     1.45  1.4
 4:    ABC 2010-12-31 2010-01-13  1   1.5 2009-07-17 2010-01-13     1.40  1.4
 5:    ABC 2010-12-31 2010-04-01  1   1.7 2009-10-03 2010-04-01     1.40  1.5
 6:    ABC 2010-12-31 2010-01-07  2   1.3 2009-07-11 2010-01-07     1.50   NA
 7:    ABC 2010-12-31 2010-01-11  2   1.2 2009-07-15 2010-01-11     1.50  1.3
 8:    ABC 2010-12-31 2010-04-02  2   1.8 2009-10-04 2010-04-02     1.65  1.2
 9:    ABC 2010-12-31 2010-08-03  2   1.7 2010-02-04 2010-08-03     1.70  1.8
10:    ABC 2010-12-31 2010-01-09  3   1.6 2009-07-13 2010-01-09     1.35   NA
11:    DEF 2011-12-31 2011-02-05  1   2.3 2010-08-09 2011-02-05       NA   NA

R: using data.table := operations to calculate new columns

2 Answers2

Linked