Loop of regressions on input range - How can I avoid the for loop and improve performance?

Question

I am currently backtesting a strategy which involves an lm() regression and a probit glm() regression. I have a dataframe named forBacktest with 200 rows (1 for each day to backtest) and 9 columns : the first 8 (x1 to x8) are the explanatory variables and the last one (x9) is the real value (which I am trying to explain in the regression). To do the regression, I have an other dataframe named temp which has like 1000 rows (one for each day) and a lot of columns, some of which are the x1 to x8 values and also the x9 value.

But the tricky part is that I do not just generate a regression-model and then a loop for predict because I select a part of the dataframe temp based on the values of x1 which I split in 8 different ranges and then, according to the value x1 of the dataframe forBacktest, I do a regression with a part of temp with x1 in a given range.

So what I do is that for each one of the 200 rows, I take x1 and if x1 is between 0 and 1 (for example) then I create a part of temp where all the x1 are between 0 and 1, then I make a regression to explain x9 with x1, x2, ... x9 (just x1+x2+..., there is no x1:x2, x1^2,...) and then I use the predict function with the dataframe forBacketst. If I predict a positive value and if x9 is positive then I increment a counter success by one (idem if both are negative), but if one is positive and the other negative, then success stays the same. Then I take the next row and so on. At the end of the 200 rows, I now have an average of the successes which I return. In fact, I have two averages : one for the lm regression and the other for the glm regression (same methodology, I just take sign(x9) for the variable to explain).

So my question is: how can I efficiently do that in R, if possible without a big for loop with 200 iterations where for each iteration, it creates a part of the dataframe, makes the regressions, predict the two values, add them to a counter and so on? (this is currently my solution but I find it too slow and not very R-like)

My code looks like that :

backtest<-function() {
    for (i in 1:dim(forBacktest)[1]) {
        x1 <- forBacktest[i,1]: x2 <- forBacktest[i,2] ... x9 <- forBacktest[i,9]
        a <- ifelse(x1>1.5,1.45,ifelse(x1>1,0.95,.... 
        b <- ifelse(x1>1.5,100,ifelse(x1>1,1.55,....
        temp2 <- temp[(temp$x1>=a/100)&(temp$x1<=b/100),]
        df <- dataframe(temp$x1,temp$x2,...temp$x9)
        reg <- lm(temp$x9~.,data=df)
        df2 <- data.frame(x1,x2,...x9)
        rReg <- predict(reg,df2)
        trueOrFalse <- ifelse(sign(rReg*x9)>0,1,0)
        success <- success+trueOrFalse
    }
    success
}

do you want to divide a column in a dataframe by specific range and then do lm() on to the values corresponding to one of those ranges? — raiyan, Sep 11 '15 at 13:43
you understand that doing that would also remove the rows from all the other columns i.e. from x2 to x9, right? — raiyan, Sep 11 '15 at 13:47
yes I only want the regression when x1 is in a given range so only the x2...x9 values in the rows where x1 is in a given range. But I don't want to destroy the dataframe so i use temp2 — etienne, Sep 11 '15 at 13:54
the code has many changes.. i will post the edited code in a while — raiyan, Sep 11 '15 at 14:35
So your loop is to select ranges of input; is that essentially k-fold Cross-Validation? — smci, May 13 '19 at 22:13

score 1 · Accepted Answer · edited May 23 '17 at 12:22

1

The code you have written is way much complicated. Things could be much much simpler..

Use the cut() and the by() function.

breaks <- 0:8 #this is the range by which you want to divide your data
divider <- cut(forBackTest$x1,breaks)
subsetDat <- by(forBackTest,INDICES = divider,data.frame) # this creates 8 dataframes
reg <- lapply(subsetDat,lm,formula=x9~.)

'reg' will now contain all the 8 lm objects corresponding to the 8 ranges. To predict for all these ranges use lapply() with reg and the temp dataframe. It will return you the predicted values for eight ranges

Few things to keep in mind:

The method suggested above is simpler and easier to read. It will be faster than your for loops, but as the size of data frame increases, it could get slower.
The by function takes a dataframe, and applies the function specified (data.frame()) to the subsetted dataframe specified by INDICES and returns a list. So new dataframes are created and this could take up a lot of space if the size of dataframe is large.
*apply() is much faster than for loops. See here to know more about them. The apply family comes handy for these kind of operations

edited May 23 '17 at 12:22

Community

1
1

answered Sep 11 '15 at 14:05

raiyan

821
6
15

but where would I write the x1,...x9 ? Like lm(x9~x1+x2) or lm,x9~x1+x2 because all of these return error – etienne Sep 11 '15 at 14:26
I got the error "cannot coerce class "lm" to a data.frame" – etienne Sep 11 '15 at 14:33
the code has many changes.. i will post the correct code in a while – raiyan Sep 11 '15 at 14:34
I think you changed forBacktest and temp because the regression should be made on temp. But after the change of names it works well. On the other hand for the prediction I used `lapply(c(1:8),function(x){predict(reg[[x]],forBacktest})` and it returns 8 sets of 200 predictions (size of forBacktest) and it doesn't "split" forBacktest. How can I only return 200 predictions (and not 8*200) which would be the correct predictiosn for the value x1 of forBacktest ? – etienne Sep 14 '15 at 08:31
that's ok I just had to split forBacktest like I did for temp. – etienne Sep 14 '15 at 08:38

Loop of regressions on input range - How can I avoid the for loop and improve performance?

1 Answers1