0

I am running a code to plot 10-fold cross-validation errors(something I learnt at an online Stanford course). The goal is to find out which of a variety of variable values (a polynomial of degrees 1 to 5) is the best fit for further out of sample predictions

The initial coding is:

require(ISLR)
require(boot)
cv.error10=rep(0,5)
for(d in degree){
  glm.fit=glm(mpg~poly(horsepower,d), data=Auto)
  cv.error10[d]=cv.glm(Auto,glm.fit,K=10)$delta[1]
}
plot(degree,cv.error10,type="b",col="red")

This works fine. Now I am trying to do the same for my data (I run a negative binomial)

glm.fit2<-glm(abs_pb_t ~ RI1 + RA1 + pacl_to + abs_pb + pbf3 + pbf32 + pbs + pbc + suc_inc + cdval +  + cdval:RI1 + cdval:RA1 + pbf3:RI1 + pbf3:RA1 + pbf32:RI1 + pbf32:RA1 + Rfirm, data=p.data, family="quasipoisson") 
cv.error10=rep(0,5)
for(d in degree){
  glm.fit2=update(glm.fit2 , ~. +  poly(yy,d), data=p.data, na.action="na.exclude")
  cv.error10[d]=cv.glm(p.data,glm.fit2,K=10)$delta[1]
}

I added the exclusion of NA values because people had suggested this in other SO questions (here , here and here).

I get the following error:

Error in model.frame.default(formula = abs_pb_t ~ RI1 + RA1 + lnRnD +  : 
  variable lengths differ (found for 'poly(yy, d)')

In my update formula the variable yy is a count variable that perfectly fits my data.frame (592 observations)

yy<-rep(seq(1:16),times=37) ; poly(yy,1) ; poly(yy,5)

According to the help file on "poly" missing values are not allowed so I do not understand why this variable would suddenly (by using the polynomial) generate missing values. I checked this and the polynomial does not create NA variables so something else must explain why I get this error.

Any ideas?

Thanks in advance

Community
  • 1
  • 1
SJDS
  • 1,239
  • 1
  • 16
  • 31
  • `na.exclude` causes the data to have fewer observations. I suggest you subset the rest of the data by `complete.cases(yy)` so they are the same length. – ilir Apr 22 '14 at 13:23
  • dim(complete.cases(p.data)) and length(poly(yy,d)) ... I'm confused it looks like poly(yy,d) is a list. How can you just add it to a glm? – Andrew Cassidy Apr 22 '14 at 13:32
  • Hi both, thanks for your replies. I initially did not add the na.exclude option. Without it the problem is the same. If I add @ilir 's suggestion then I get a new error (subscript out of bounds). @ Andrew, it indeed looks like poly(yy,d) is a list. However, it has exactly the same structure as the poly(horsepower,d) which does not cause a problem and that poly(yy,3) (or any other value but 3) can be used in regressions. It seems that only the matrix values in the generated list are actually used. – SJDS Apr 22 '14 at 16:48
  • Index out of bounds from `p.data[complete.cases(yy)]` can only be if `yy` is longer than `p.data` has rows. Make sure they are the same size. – ilir Apr 22 '14 at 16:51
  • @ilir thanks again for thinking with me but this is not the case. yy is exactly the same size in vector length. What I have come to believe in being faced with a completely different problem is that the fact that yy repeats for all firms is really what causes this problem. In other words, yy only varies over time but not over the different firms (Rfirm). I'm pretty sure that's the cause of the problem. Why this cause generates that specific error and how to solve it is not yet clear to me – SJDS Apr 23 '14 at 11:46

0 Answers0