I am running a code to plot 10-fold cross-validation errors(something I learnt at an online Stanford course). The goal is to find out which of a variety of variable values (a polynomial of degrees 1 to 5) is the best fit for further out of sample predictions
The initial coding is:
require(ISLR)
require(boot)
cv.error10=rep(0,5)
for(d in degree){
glm.fit=glm(mpg~poly(horsepower,d), data=Auto)
cv.error10[d]=cv.glm(Auto,glm.fit,K=10)$delta[1]
}
plot(degree,cv.error10,type="b",col="red")
This works fine. Now I am trying to do the same for my data (I run a negative binomial)
glm.fit2<-glm(abs_pb_t ~ RI1 + RA1 + pacl_to + abs_pb + pbf3 + pbf32 + pbs + pbc + suc_inc + cdval + + cdval:RI1 + cdval:RA1 + pbf3:RI1 + pbf3:RA1 + pbf32:RI1 + pbf32:RA1 + Rfirm, data=p.data, family="quasipoisson")
cv.error10=rep(0,5)
for(d in degree){
glm.fit2=update(glm.fit2 , ~. + poly(yy,d), data=p.data, na.action="na.exclude")
cv.error10[d]=cv.glm(p.data,glm.fit2,K=10)$delta[1]
}
I added the exclusion of NA values because people had suggested this in other SO questions (here , here and here).
I get the following error:
Error in model.frame.default(formula = abs_pb_t ~ RI1 + RA1 + lnRnD + :
variable lengths differ (found for 'poly(yy, d)')
In my update formula the variable yy is a count variable that perfectly fits my data.frame (592 observations)
yy<-rep(seq(1:16),times=37) ; poly(yy,1) ; poly(yy,5)
According to the help file on "poly" missing values are not allowed so I do not understand why this variable would suddenly (by using the polynomial) generate missing values. I checked this and the polynomial does not create NA variables so something else must explain why I get this error.
Any ideas?
Thanks in advance