0

I am currently backtesting a strategy which involves an lm() regression and a probit glm() regression. I have a dataframe named forBacktest with 200 rows (1 for each day to backtest) and 9 columns : the first 8 (x1 to x8) are the explanatory variables and the last one (x9) is the real value (which I am trying to explain in the regression). To do the regression, I have an other dataframe named temp which has like 1000 rows (one for each day) and a lot of columns, some of which are the x1 to x8 values and also the x9 value.

But the tricky part is that I do not just generate a regression-model and then a loop for predict because I select a part of the dataframe temp based on the values of x1 which I split in 8 different ranges and then, according to the value x1 of the dataframe forBacktest, I do a regression with a part of temp with x1 in a given range.

So what I do is that for each one of the 200 rows, I take x1 and if x1 is between 0 and 1 (for example) then I create a part of temp where all the x1 are between 0 and 1, then I make a regression to explain x9 with x1, x2, ... x9 (just x1+x2+..., there is no x1:x2, x1^2,...) and then I use the predict function with the dataframe forBacketst. If I predict a positive value and if x9 is positive then I increment a counter success by one (idem if both are negative), but if one is positive and the other negative, then success stays the same. Then I take the next row and so on. At the end of the 200 rows, I now have an average of the successes which I return. In fact, I have two averages : one for the lm regression and the other for the glm regression (same methodology, I just take sign(x9) for the variable to explain).

So my question is: how can I efficiently do that in R, if possible without a big for loop with 200 iterations where for each iteration, it creates a part of the dataframe, makes the regressions, predict the two values, add them to a counter and so on? (this is currently my solution but I find it too slow and not very R-like)

My code looks like that :

backtest<-function() {
    for (i in 1:dim(forBacktest)[1]) {
        x1 <- forBacktest[i,1]: x2 <- forBacktest[i,2] ... x9 <- forBacktest[i,9]
        a <- ifelse(x1>1.5,1.45,ifelse(x1>1,0.95,.... 
        b <- ifelse(x1>1.5,100,ifelse(x1>1,1.55,....
        temp2 <- temp[(temp$x1>=a/100)&(temp$x1<=b/100),]
        df <- dataframe(temp$x1,temp$x2,...temp$x9)
        reg <- lm(temp$x9~.,data=df)
        df2 <- data.frame(x1,x2,...x9)
        rReg <- predict(reg,df2)
        trueOrFalse <- ifelse(sign(rReg*x9)>0,1,0)
        success <- success+trueOrFalse
    }
    success
}            
smci
  • 32,567
  • 20
  • 113
  • 146
etienne
  • 3,648
  • 4
  • 23
  • 37
  • 1
    do you want to divide a column in a dataframe by specific range and then do lm() on to the values corresponding to one of those ranges? – raiyan Sep 11 '15 at 13:43
  • yes that would be it – etienne Sep 11 '15 at 13:46
  • you understand that doing that would also remove the rows from all the other columns i.e. from x2 to x9, right? – raiyan Sep 11 '15 at 13:47
  • 1
    yes I only want the regression when x1 is in a given range so only the x2...x9 values in the rows where x1 is in a given range. But I don't want to destroy the dataframe so i use temp2 – etienne Sep 11 '15 at 13:54
  • the code has many changes.. i will post the edited code in a while – raiyan Sep 11 '15 at 14:35
  • So your loop is to select ranges of input; is that essentially k-fold Cross-Validation? – smci May 13 '19 at 22:13

1 Answers1

1

The code you have written is way much complicated. Things could be much much simpler..

Use the cut() and the by() function.

breaks <- 0:8 #this is the range by which you want to divide your data
divider <- cut(forBackTest$x1,breaks)
subsetDat <- by(forBackTest,INDICES = divider,data.frame) # this creates 8 dataframes
reg <- lapply(subsetDat,lm,formula=x9~.) 

'reg' will now contain all the 8 lm objects corresponding to the 8 ranges. To predict for all these ranges use lapply() with reg and the temp dataframe. It will return you the predicted values for eight ranges

Few things to keep in mind:

  • The method suggested above is simpler and easier to read. It will be faster than your for loops, but as the size of data frame increases, it could get slower.
  • The by function takes a dataframe, and applies the function specified (data.frame()) to the subsetted dataframe specified by INDICES and returns a list. So new dataframes are created and this could take up a lot of space if the size of dataframe is large.
  • *apply() is much faster than for loops. See here to know more about them. The apply family comes handy for these kind of operations
Community
  • 1
  • 1
raiyan
  • 821
  • 6
  • 15
  • but where would I write the x1,...x9 ? Like lm(x9~x1+x2) or lm,x9~x1+x2 because all of these return error – etienne Sep 11 '15 at 14:26
  • I got the error "cannot coerce class "lm" to a data.frame" – etienne Sep 11 '15 at 14:33
  • the code has many changes.. i will post the correct code in a while – raiyan Sep 11 '15 at 14:34
  • I think you changed forBacktest and temp because the regression should be made on temp. But after the change of names it works well. On the other hand for the prediction I used `lapply(c(1:8),function(x){predict(reg[[x]],forBacktest})` and it returns 8 sets of 200 predictions (size of forBacktest) and it doesn't "split" forBacktest. How can I only return 200 predictions (and not 8*200) which would be the correct predictiosn for the value x1 of forBacktest ? – etienne Sep 14 '15 at 08:31
  • that's ok I just had to split forBacktest like I did for temp. – etienne Sep 14 '15 at 08:38