0

I am attempting to run a rolling regression within a data.table. There are a number of questions that get at what I am trying to do, but they are generally 3+ years old and offer inelegant answers. (see: here, for example)

I am wondering if there has been any update to the data.table package that make this more intuitive/ faster?

Here is what I am trying to do. My code looks like this:

DT<-data.table(
  Date = seq(as.Date("2000/1/1"), by = "day", length.out = 1000),
  x1=rnorm(1000),
  x2=rnorm(1000),
  x3=rnorm(1000),
  y=rnorm(1000),
  country=rep(c("a","b","c","d"), each=25))

I would like to regress y on x1, x2 and x3, over a rolling 180 day window, by country, and store the coefficients by date.

Ideally the syntax would look something like this:

DT[,.(coef.x1 := coef(y~x1+x2+x3)[2] , 
coef.x2 := coef(y~x1+x2+x3)[3], 
coef(y~x1+x2+x3)[4],
by=c("country",ROLLING WINDOW)]

... but even more elegant/ avoiding the repetition if possible! :)

I have yet to get the rollapply syntax to work well for me for some reason.

Thank you!


EDIT:

Thank you @michaelchirico.

Your suggestion comes close to what I'm aiming for - and maybe its possible to modify the code to receive it but again, I am stuck.

Here is a more careful articulation of what I need. Some code:

DT<-data.table(
  Date = rep(seq(as.Date("2000/1/1"), by = "day", length.out = 10),times=3), #same dates per country

  x1=rep(rnorm(10),time=3), #x1's repeat - same per country
  x2=rep(rnorm(10), times=3),#x2's repeat - same per country
  x3=rep(rnorm(10), times=3), #x3's repeat - same per country
  y=rnorm(30), #y's do not repeat and are unique per country per day
  country=rep(c("a","b","c"), each=10))

#to calculate the coefficients by individual  country: 
a<-subset(DT,country=="a")
b<-subset(DT,country=="b")

window<-5 #declare window
coefs.a<-coef(lm(y~x1+x2+x3, data=a[1:window]))#initialize my coef variable
coefs.b<-coef(lm(y~x1+x2+x3, data=b[1:window]))#initialize my coef variable

##calculate coefficients per window

for(i in 1:(length(a$Date)-window)){
  coefs.a<-rbind(coefs.a, coef(lm(y~x1+x2+x3, data=a[(i+1):(i+window-1)])))
  coefs.b<-rbind(coefs.b, coef(lm(y~x1+x2+x3, data=b[(i+1):(i+window-1)])))
 }

The difference in this dataset versus the prior one is that the dates, and x1, x2, x3 all repeat. My y's are unique for each country.

In my actual data set I have 120 countries. I can calculate this for each country, but it is awfully slow and then I have to rejoin all of the coefficients into a single dataset for analysis of the results.

Is there a way similar to what you proposed to end up with a single data.table, with all observations?

Thanks once more!!

Community
  • 1
  • 1
jsl2
  • 43
  • 7
  • 1
    And what is `coef(y~x1+x2+x3)[1]`, exactly? Do you want y~X for days 1,...,180, then for days 2,...,181, then for 3,...,182, etc? – MichaelChirico Dec 12 '15 at 21:36
  • Exactly what I'm looking for. I'd like the first observation to be the coefficients over days 1-180 and the second observation to be the coefficients over days 2-181 etc. Then roll forward. Appreciate the help – jsl2 Dec 12 '15 at 23:25
  • Do you mean days absolutely or relatively? So days 1,... 180 _within_ each country, or days 1,...180 in the _whole_ data set? – MichaelChirico Dec 12 '15 at 23:27
  • Within each country. In my actual data the y is different for each country but the dates and Xs repeat. (I.e. the time is over just a single time period, repeating for each country). I should redo the mock data to set that up correctly. – jsl2 Dec 12 '15 at 23:41
  • I notice that each country doesn't necessarily have observations for each day... should the window be 180 days _of actual observations_ or 180 _consecutive days_ (regardless of whether there are observations)? – MichaelChirico Dec 13 '15 at 00:12
  • I'm away from my computer so will test it as soon as I'm back. Either way thanks for the help! – jsl2 Dec 13 '15 at 00:41
  • I think my code does what your code is trying to do, except that you've got a typo -- the right-hand index should be `(i+1):(i+window)` (because it's `(i+1)` to `(i+1) + (window-1)`). As it stands your window is only 4 wide in the loop. Once that is fixed our results are the same. – MichaelChirico Dec 13 '15 at 20:56

2 Answers2

0

It's still not clear exactly what you're after, but here's a shot which should be close (only minor adjustments need be made depending on the details):

I can't really speak to speed.

TT <- DT[ , uniqueN(Date), by = country][ , max(V1)]
window <- 5
#pre-declare a matrix of windows; each column represents
#one of the possible windows of days
windows <- matrix(1:TT, nrow = TT + 1, ncol = max(TT - window + 1, 1))[1:window, ]

DT[ , {
  #not all possible windows necessarily apply to each
  #  country; subset to find only the relevant windows
  windowsj <- windows[ , 1:(uniqueN(Date) - window + 1)]
  #lapply returns a list (which can be readily assigned with :=)
  lapply(1:ncol(windowsj),
         function(ii){
           #subset to relevant rows
           .SD[windowsj[ , ii],
               #regress, extract
               lm(y ~ x1 + x2 + x3)$coefficients]})},
  by = country]

Comparing the result of this to your coefs.a and coefs.b:

    country         V1          V2         V3          V4          V5          V6
 1:       a -0.8764867  0.46169717  2.6712128  2.66304537  1.18928600  0.53553900
 2:       a -1.0135961  0.03985467  0.6015446  0.61316724  0.24177034  0.86369780
 3:       a -0.1807617 -0.25767309 -2.9492897 -3.05092528 -0.04310375  0.62317993
 4:       a -0.6664342 -0.30732907 -0.3362091 -0.25776715  1.04419854  1.02294125
 5:       b  0.9548685  0.77461810 -0.5100818 -0.57726788 -0.73285223 -1.64196684
 6:       b  0.7179429  0.46107110  0.1732915  0.23262455  0.23258149  3.63679221
 7:       b  0.1639778 -0.22249382  1.4539881  0.58725270  0.54879762 -0.27115275
 8:       b  0.6192641  0.12706750  0.2671673  0.79569434  0.69031761  2.27769679
 9:       c  0.2722200  0.07279085 -0.7709578 -0.74590575 -0.15773196  0.03178821
10:       c  0.8890314  0.74213624  0.4440650  0.34939003  0.50531166  0.16550026
11:       c  0.1589915  0.20531447  0.9931054  1.25495206 -0.01543296 -0.09887655
12:       c  0.7198967  0.70536869  0.4508445  0.02028332 -0.54705588 -0.64246579

> coefs.a
        (Intercept)          x1          x2         x3
coefs.a  -0.8764867 -1.01359605 -0.18076171 -0.6664342
          0.4616972  0.03985467 -0.25767309 -0.3073291
          2.6712128  0.60154458 -2.94928969 -0.3362091
          2.6630454  0.61316724 -3.05092528 -0.2577671
          1.1892860  0.24177034 -0.04310375  1.0441985
          0.5355390  0.86369780  0.62317993  1.0229412

(i.e. it's the same, just transposed)

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • This is almost but not quite what I am looking for - and perhaps there is an easy way to transform it into exactly what I need? I just edited the question again above. Thanks for help. – jsl2 Dec 13 '15 at 13:59
0

frollapply only accepts numeric vector input and output, so we have to write our own with sapply() along the row indices.

window <- 180
DT[, 
   {
     data.table(t(sapply(seq_len(.N - window + 1),
                         function(k) lm(y ~ x1 + x2 + x3, 
                                        data = .SD[k:(k + window)])$coefficients)))
   }, 
   by = country] 
##      country (Intercept)         x1          x2          x3
##   1:       a  0.10163170 0.09561343 -0.11123725 -0.06489867
##   2:       a  0.11029460 0.08927926 -0.10657563 -0.06035072
##   3:       a  0.11328084 0.08856627 -0.10521865 -0.06278259
##   4:       a  0.12348242 0.07503412 -0.10483616 -0.06638923
##   5:       a  0.13285512 0.09268086 -0.11239769 -0.04068656
##  ---                                                       
## 280:       d  0.08249204 0.06252626 -0.06965884 -0.09680134
## 281:       d  0.07864977 0.05395658 -0.06137728 -0.10774067
## 282:       d  0.07937867 0.06996970 -0.07991358 -0.11377039
## 283:       d  0.07654691 0.06546692 -0.06824516 -0.10902969
## 284:       d  0.06123857 0.08590249 -0.05117317 -0.11728684
``
Drumy
  • 450
  • 2
  • 16