R - Splitting Data, regression and applying equation to new split data set

Question

I have a large data set that has older and newer data. I created two data frames, EarlyYears with the older data and LaterYears with the new data, so they have the same columns.

What I want to do is regress the data from Early years to determine an equation and apply it to the Later Years to test the equation's strength - A and B are constants, Input is what I am testing - I change it for different runs of the code - and Dummy is 1 is there is no data for the input. However, I want to split both the EarlyYears and LaterYears data by quintiles of one of the variables, and apply the equation found in quintile 1 of EarlyYears to data from LaterYears that is in quintile 1. I am fairly new at R, and so far have:

Model<-data.frame(Date = rep(c("3/31/09","3/31/11"),each = 20), 
InputRating = rep(c(1:5), 8), Dummy = rep(c(rep(0,9),1),4),
Y = rep(1,3,5,7,11,13,17,19), A = 1:40,B = 1:40*3+7)
newer<-as.numeric(grep("/11",Model$Date))
later<-as.numeric(grep("/11",Model$Date,invert = TRUE))

LaterYears<-Model[newer,]
EarlyYears<-Model[later,]
newModel<-EarlyYears

DataSet.Input<-data.frame(Date = newModel$Date, InputRating = newModel$InputRating, 
Dummy = newModel$Dummy, Y = newModel$Y, A = newModel$A,B = newModel$B)
quintiles<-quantile(DataSet.Input$A,probs=c(0.2,0.4,0.6, 0.8, 1.0))
VarQuint<-findInterval(DataSet.Input$A,quintiles,rightmost.closed=TRUE)+1L

regressionData<-do.call(rbind,lapply(split(DataSet.Input,VarQuint),
FUN = function(SplitData) { 
SplitRegression<-lm(Y ~ A + B + InputRating + Dummy, data = SplitData, na.action = na.omit) 
c(coef.Intercept = coef(summary(SplitRegression))[1],
coef.A = coef(summary(SplitRegression))[2], 
coef.B = coef(summary(SplitRegression))[3],
coef.Input = coef(summary(SplitRegression))[4],
coef.Dummy= coef(summary(SplitRegression))[5])
}))

i = 0
quintiles.LY<-quantile(LaterYears$A,probs=c(0.2,0.4,0.6, 0.8, 1.0))
Quint.LY<-findInterval(LaterYears$A,quintiles,rightmost.closed=TRUE)+1L

LaterYears$ExpectedValue <-apply(split(LaterYears,Quint.LY),1,
FUN = function(SplitData) {
  i=i+1
  regressionData[i,1]+regressionData[i,2]*SplitData$A +
  regressionData[i,3]*SplitData$B + regressionData[i,4]*SplitData$Input +
  regressionData[i,5]*SplitData$Dummy    
})

The first part works great to get the data in regressionData. I want this results of applying the equation to be held in a column within the LaterYears dataset, but I get an error -

Error in apply(split(LaterYears, Quint.LY), 1, FUN = function(SplitData) { :
dim(X) must have a positive length

when running this with apply, and blank when running with lapply which is what I originally tried.

Any help with how to fix this would be greatly appreciated! Thanks!

We don't have your DataSet.Quint, nor VarQuint. Can you make your problem reproducible? — Roman Luštrik, Feb 21 '13 at 18:22
Is this better? I put in the construct for the data frame and VarQuint, but newModel is just a large dataset. — user1775563, Feb 21 '13 at 19:18
I don't think so, I still can't reproduce your problem. Try http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Roman Luštrik, Feb 21 '13 at 21:18
I was being dumb and thought I could just half ass getting my question in here - In fact I was hoping that I was being really dumb and it would be obvious what was wrong! Sorry to have wasted your time earlier. I ran this code and got the same problem. Thank you!! — user1775563, Feb 21 '13 at 23:50
New to R and already using `do.call`. Good job. `lapply(split(x,y),fun)` is essentially the same as `by(x,y,fun)`. That's just a matter of simplifying. With respect to the second part it looks like you're trying to recreate `?predict`. But I'm just guessing. — Brandon Bertelsen, Feb 22 '13 at 04:40
Thank you - Most of what I have learned has come from trying things on this website, finding they work, and then trying to understand them! I had never heard of predict, but that is exactly it. Thank you!! — user1775563, Feb 22 '13 at 15:20

Brandon Bertelsen · Accepted Answer · 2013-02-22T04:50:08.347

4

Perhaps something like this, using predict would be better. It doesn't work very well for your example data but it may work on the real data.

# by, splits a dataset by a factor
regressionData <- by(DataSet.Input,VarQuint,
                     function(d) {
                       lm1 <- lm(Y ~ A + B + InputRating + Dummy, d)
                     })

quintiles.LY<-quantile(LaterYears$A,probs=seq(0,1,0.2))
Quint.LY<-findInterval(LaterYears$A,quintiles,rightmost.closed=TRUE)+1L

LaterYearsPredict <- split(LaterYears,Quint.LY)

# lapply's arguments can be anything that is a sequence
LaterYears$ExpectedValue <- unlist(lapply(1:length(LaterYearsPredict),
       function(x) 
         predict(regressionData[[x]],LaterYearsPredict[[x]])
       ))

edited Feb 22 '13 at 04:50

answered Feb 22 '13 at 04:42

Brandon Bertelsen

43,807
34
160
255

Worked perfectly on the real data - Thanks Brandon!! Now, I can add predict to my list of things that R can do that I wouldn't have thought it could but make my life so much easier! – user1775563 Feb 22 '13 at 15:24
There's one potential problem with this solution, if your data is not ordered by the split vector, when you unlist and append the expected value some of the numbers could be out of order. You'll have to sort the data.frames first, I believe. (Unless they are already ordered in some fashion) – Brandon Bertelsen Feb 22 '13 at 20:10

R - Splitting Data, regression and applying equation to new split data set

1 Answers1